# Butterfly: Robust One-step Approach towards Wildly-unsupervised Domain Adaptation

###### Abstract

Unsupervised domain adaptation (UDA) trains with clean labeled data in source domain and unlabeled data in target domain to classify target-domain data. However, in real-world scenarios, it is hard to acquire fully-clean labeled data in source domain due to the expensive labeling cost. This brings us a new but practical adaptation called wildly-unsupervised domain adaptation (WUDA), which aims to transfer knowledge from noisy labeled data in source domain to unlabeled data in target domain. To tackle the WUDA, we present a robust one-step approach called Butterfly, which trains four networks. Specifically, two networks are jointly trained on noisy labeled data in source domain and pseudo-labeled data in target domain (i.e., data in mixture domain). Meanwhile, the other two networks are trained on pseudo-labeled data in target domain. By using dual-checking principle, Butterfly can obtain high-quality target-specific representations. We conduct experiments to demonstrate that Butterfly significantly outperforms other baselines on simulated and real-world WUDA tasks in most cases.^{1}^{1}1Preprint. Work in progress.

## 1 Introduction

Domain adaptation (DA) aims to learn a discriminative classifier in the presence of a shift between training data in source domain and test data in target domain [2, 4, 27]. Currently, DA can be divided into three categories: supervised DA (SDA) [25], semi-supervised DA (SSDA) [10] and unsupervised DA (UDA) [20]. When the number of labeled data is few in target domain, SDA is also known as few-shot DA [19]. Since unlabeled data in target domain can be easily obtained, UDA methods have great potential in the real-world applications [4, 5, 7, 8, 17, 20, 21].

UDA methods train with clean labeled data in source domain and unlabeled data in target domain to classify targe-domain data, which mainly consist of three orthogonal techniques: integral probability metrics (IPM) [6, 8, 9, 14, 17], adversarial training [5, 12, 21, 26] and pseudo labeling [20]. Compared to IPM- and adversarial-training methods, the pseudo-labeling method (i.e., asymmetric tri-training domain adaptation (ATDA) [20]) can construct a high-quality target-specific representation, which provides a better classification performance on the target domain. Besides, the pseudo-labeling method has been theoretically justified [20].

However, in real-world scenarios, the data volume of source domain tends to be large. To avoid the expensive labeling cost, labeled data in source domain normally comes from amateur annotators or the Internet [15, 22, 24]. This brings us a new adaptation scenario termed as wildy-unsupervised domain adaptation (abbreviated as WUDA, Figure 1). This adaptation aims to transfer knowledge from noisy data in source domain () to unlabeled data in target domain (). Unfortunately, current UDA methods share an implicit assumption that there is no noisy data in source domain. Namely, these methods focus on transferring knowledge from clean data in source domain () to unlabeled data in target domain (). Therefore, these methods cannot well handle the WUDA.

In this paper, we theoretically reveal the deficiency of current UDA methods. To improve these methods, a straightforward strategy is a two-step approach. In Figure 1, we can first use label-noise algorithms to train a model on noisy data in source domain, then leverage this trained model to assign pseudo labels for source domain. Via UDA methods, we can transfer knowledge from pseudo-labeled data in source domain () to unlabeled data in target domain (). Nonetheless, this pseudo-labeled data is still noisy, and such two-step strategy may relieve but cannot eliminate noise effects.

To circumvent the issue of two-step approach, under the theoretical guidance, we present a robust one-step approach called Butterfly. In high level, Butterfly directly transfers knowledge from to , and uses the transferred knowledge to construct target-specific representations. In low level, Butterfly trains four networks dividing two branches (Figure 2): Two networks in Branch-I are jointly trained on noisy data in source domain and pseudo-labeled data in target domain (data in mixture domain); while two networks in Branch-II are trained on pseudo-labeled data in target domain.

The reason why Butterfly can be robust takes root in the dual-checking principle: Butterfly checks high-correctness data out, from not only the mixture domain but also the pseudo-labeled target domain. After cross-propagating these high-correctness data, Butterfly can obtain high-quality domain-invariant representations (DIR) and target-specific representations (TSR) simultaneously in an iterative manner. If we only check data in the mixture domain (i.e., single checking), the error existed in pseudo-labeled data in target domain will accumulate, leading to low-quality DIR and TSR.

We conduct experiments on simulated WUDA tasks, including MNIST-to-SYND tasks, SYND-to-MNIST tasks and Human-sentiment tasks. Besides, we conduct experiments on real-world WUDA tasks. Empirical results demonstrate that Butterfly can robustly transfer knowledge from noisy data in source domain to unlabeled data in target domain. Meanwhile, Butterfly performs much better than current UDA methods when source domain suffers the extreme (e.g., ) noise.

## 2 Wildly-unsupervised domain adaptation

Here, we first describe prerequisites and new setting in domain adaptation. Then, we analyze that why current UDA methods cannot well handle this new setting.

### 2.1 Prerequisites

We use following notations in this section: 1) is a topological space and is a label set; 2) , and represent densities of noisy, correct and incorrect multivariate random variables (m.r.v.) defined on , respectively^{2}^{2}2There are two common ways to express the density of noisy m.r.v. (Appendix 0.A). One way is to use a mixture of densities of correct and incorrect m.r.v.., and , and are their marginal densities; and 3) represents density of multivariate random variable defined on ; and 4) we use to represent loss function between two labelling functions; and 5) we use and to represent expected risks on the noisy and correct m.r.v.; and 6) we use , and to represent expected discrepancy between two labelling functions under different marginal densities; 7) the ground-truth and pseudo labeling function of the target domain are denoted by and .

### 2.2 New adaptation: from noisy source data to unlabeled target data

We formally define the new adaptation as follows.

###### Definition 1 (Wildly-unsupervised domain adaptation)

Let be a multivariate random variable defined on the space with respective a probability density , where . Given i.i.d. samples and drawn from and , a wildly-unsupervised domain adaptation aims to train with and to accurately annotate each .

###### Remark 1

In Definition 1, is referred as a source domain with noisy labeled data, is referred as a target domain with unlabeled data, and and are two probability measures corresponding to densities and .

### 2.3 Deficiency of current UDA methods

Theoretically, we analyze why current UDA methods cannot well transfer useful knowledge from a noisy source data to an unlabelled target data directly. We first present a theorem to show a relation between and .

###### Theorem 1

###### Remark 2

In Eq. (2), represents the expected risk of the incorrect multivariate random variable. To ensure that we can gain useful knowledge from , we need to avoid . Specifically, we assume: there is a constant such that .

Theorem 1 shows that, in the source domain with noisy data, the expected risk only equals when two cases happen: 1) and and 2) some special combinations (e.g., special , , , and ) to make the second term in Eq. (1) equal zero or to make the second term in Eq. (2) equal . Case 1) means that data in source domain is clean, which is not real in the wild. Case 2) almost never happens, since it is hard to find such special combinations when , , and are unknown. Thus, has an essential difference with . Then, we derive the upper bound of as follows.

###### Theorem 2

For any labelling function , we have

(3) |

###### Remark 3

Similar with Remark 2, to ensure that we can obtain useful knowledge from the pseudo labelling function , we assume: there is a constant such that and , where , .

## 3 Two-step approach or one-step approach?

### 3.1 Two-step approach: A compromise solution

To reduce noise effects in source domain, a straightforward way is to apply a two-step strategy. For example, we first use Co-teaching [11] to train a model on noisy source data, then leverage this trained model to assign pseudo labels for source domain. Via ATDA approach, we can transfer knowledge from the pseudo-labeled source data to the target data.

Nonetheless, the pseudo-labeled source data is still noisy. Let noisy labels in source domain be replaced with pseudo labels after pre-processing. Thus, noise effects will become pseudo-label effects as follows.

(4) |

where and correspond to and in . It is clear that the difference between and is . The first term in may be less than that in due to Co-teaching, but the second term in may be higher than that in since Co-teaching does not consider to minimize it. Thus, it is hard to say whether (i.e., ). This means that, the two-step strategy may not really reduce noise effects.

### 3.2 One-step approach: A noise-eliminating solution

To eliminate noise effects , we aim to select correct data simultaneously from noisy source data and pseudo-labeled target data. In theory, we prove that noise effects will be eliminated if we can select correct data with a high probability. Let represent the probability that incorrect data is selected from noisy source data, and represent the probability that incorrect data is selected from pseudo-labeled target data. The following theorem shows that if and and present a new upper bound of .

###### Theorem 3

###### Remark 4

It should be noted that is actually related to and is related to . In the proof of Theorem 3, we give rigorous definitions of and .

Data drawn from the distribution of can be regarded as a pool that mixes the selected () and unselected () noisy source data. Data drawn from the distribution of can be regarded as a pool that mixes the selected () and unselected () pseudo-labeled target data. Theorem 3 shows that if the selected data has a high probability to be correct one ( and ), then and will approach , meaning that noise effects are eliminated. This motivates us to find a reliable way to select correct data from noisy source data and pseudo-labeled target data.

## 4 Butterfly: Towards robust one-step approach

This section presents a robust one-step approach called Butterfly in details, and demonstrates how Butterfly minimizes all terms in the right hand side of Eq. (3).

### 4.1 Principled design of Butterfly

Guided by Theorem 3, a robust approach should check high-correctness data out (meaning and ). This checking process will make and become . Then, we can obtain gradients of , and w.r.t. parameters of and use these gradients to minimize and . Note that cannot be directly minimized since we cannot pinpoint clean data from source domain. However, following [20], we can indirectly minimize via minimizing , as , where the last inequality follows (5). This means that a robust approach guided by Theorem 3 can minimize all terms in the right side of inequality in (3).

To realize this robust approach, we propose a Butterfly paradigm (Algorithm 2), which trains four networks dividing into two branches (Figure 2). By using dual-checking principle, Branch-I checks which mixture data is correct; while Branch-II checks which target data is correct. To ensure these checked data highly-correct, we apply the small-loss trick based on memorization effects of deep learning [1]. After cross-propagating these checked data [3], Butterfly can obtain high-quality domain-invariant representations (DIR) and target-specific representations (TSR) simultaneously in an iterative manner. Theoretically, Branch-I minimizes terms ; while Branch-II minimizes terms . This means that Butterfly can minimize all terms in the right side of inequality in (3).

### 4.2 Loss function in Butterfly

Due to , and in Theorem 3, four networks trained by Butterfly share the same loss function but with different inputs.

(7) |

where is the batch size, and represents a network (e.g., and ). is a mini-batch for training a network, where could be a noisy source data or a pseudo-labeled target data, and is the parameters of and is an -by- vector whose elements equal 0 or 1. For two networks in Branch-I, following [20], we also add a regularizer in their loss functions, where and are weights of the first fully-connect layer of and . With this regularizer, and will learn from different features.

### 4.3 Training procedure of Butterfly

For two networks in each branch, they will first check high-correctness data out and then cross update their parameters using these data. Algorithm 1 demonstrate how and (or and ) check these data out and use them to update on a mini-batch .

Based on loss function defined in Eq. (7), the entire training procedure of Butterfly is shown in Algorithm 2. First, the algorithm initializes training data for two branches ( for Branch-I and for Branch-II), four networks ( and ) and the number of pseudo labels (line ). In the first epoch (), and are the same with because there are no labeled target data. When mini-batch is fetched from (line ), and check high-correctness data out and update their parameters using Algorithm 1 (lines ). Using similar procedures, and can also update their parameters using Algorithm 1 (lines -).

In each epoch, after mini-batch updating, we randomly select unlabeled target data and assign them pseudo labels using and (lines ). Following [20], the Labeling function in Algorithm 2 (line ) assigns pseudo labels for target data, when predictions of and agree and at least one of them is confident about their predictions (probability above or ). Using this function, we can obtain the pseudo-labeled target domain for training Branch-II in the next epoch. Then, we merge and to be for training Branch-I in the next epoch (line ). Finally, we update , and in lines -.

### 4.4 Relations to Co-teaching and TCL

Although Co-teaching [11] applies the small-loss trick and the cross-update technique to train deep networks against noisy data, it can only deal with one-domain problem instead cross-domain problem. Recalling definitions of and in (2), Co-teaching can only minimize the first term in or , and ignore the second term in . This deficiency limits Co-teaching to eliminate noise effects . However, Butterfly can naturally eliminate them. Recently, transferable curriculum learning (TCL) is a robust method to handle the WUDA task [23]. TCL uses small-loss data to train the domain-adversarial neural network (DANN) [5]. However, TCL can only minimize , while Butterfly can minimize all terms in the right side of inequality in (3).

## 5 Experiments

Simulated Datasets. We verify the effectiveness of our approach on three benchmark (vision and text) datasets, including MNIST, SYN-DIGITS (SYND) and Amazon products reviews (e.g., book, dvd, electronics and kitchen). They are used to construct basic tasks: MNISTSYND, SYNDMNIST, bookdvd (BD), bookelectronics (BE), , and kitchenelectronics (K E). These tasks are often used for evaluation of UDA methods [5, 20, 21]. Since all source datasets are clean, we need to corrupt source datasets manually by a noise transition matrix [11, 13], which can form WUDA tasks. In details, assume that the matrix has two representative structures: 1) Symmetry flipping; 2) Pair flipping. Their precise definition is presented in Appendix 0.B.

The noise rate is chosen from . Intuitively, means almost over half of the source data have wrong labels that cannot be learned without additional assumptions. means only labels are corrupted, which is a low-level noise situation. Note that pair case is much harder than symmetry case [11]. For each basic task, we have four kinds of noisy data in source domain: Pair- (P45), Pair- (P20), Symmetry- (S45), Symmetry- (S20). Thus, we evaluate the performance of each method using WUDA tasks: 8 digit recognition tasks and human-sentiment tasks. Note that human sentiment analysis is a binary classification problem, so pair flipping is equal to symmetry flipping. Thus, we only have human-sentiment tasks. Results on human-sentiment tasks are reported in Appendix 0.C.

Real-world Datasets. We also verify the efficacy of our approach on “dense cross-dataset benchmark” including Bing, Caltech256, Imagenet and SUN (BCIS) [24]. In this benchmark, Bing, Caltech, Imagenet and SUN contain common 40 classes. Since Bing dataset was formed by collecting images retrieved by Bing image search, it contains rich noisy samples noises, with presence of multiple objects in the same image, polysemy and caricaturization [24]. We use Bing as the source domain, and Caltech, Imagenet and SUN as target domain, which formed three real-world WUDA tasks.

Baselines. We compare Butterfly (abbreviated as B-Net) with following baselines: 1) ATDA: representative pseudo label based UDA method [20]; 2) deep adaptation networks (DAN): representative IPM based UDA method [17]; 3) DANN: representative adversiral training based UDA method [5]; 4) Co teaching+ATDA (Co+ATDA): a two-step method, which is a combination of the state-of-the-art label-noise learning algorithm (Co-teaching) [11] and UDA method (ATDA) [20]; 5) TCL: existing robust method for WUDA; 6) Butterfly-Net with target-specific network (B-Net-1T): without considering negative effects (single-checking method). Note that ATDA is the most related UDA method compared to B-Net. Implement details of each methods are reported in Appendix 0.D.

Results on simulated WUDA (including tasks). Table 1 reports the accuracy on the target domain in tasks. As can be seen, on Symmetry- case (the easiest case), most methods work well. ATDA has a satisfactory performance although it does not consider the source-domain noise explicitly. Then, when facing harder cases (i.e., Pair- and Pair-), ATDA fails to transfer useful knowledge from the source domain to the target domain. On Pair-flip cases, the performance of ATDA is much lower than our methods. When facing hardest cases (MS with P45 and S45), DANN has the higher accuracy than DAN and ATDA. However, when facing easiest cases (i.e., SM with P20 and S20), the performance of DANN is worse than that of DAN and ATDA.

Although two-step method Co+ATDA outperforms ATDA on all tasks, it cannot beat one-step methods (B-Net-1T and B-Net) in terms of average accuracy. This result is an evidence for the claim in Section 3.1. In Table 1, B-Net outperforms B-Net-1T in out of tasks. This reveals that pseudo-labeled data in the target domain indeed reduces the quality of TSR. Note that B-Net cannot outperform all methods in all tasks. In the task SM with P20, Co+ATDA outperforms all methods (slightly higher than B-Net), since pseudo-labeled data in source domain is almost correct. In the task MS with S45, B-Net-1T outperform all methods. Specifically, B-Net-1T performs better than B-Net, as pseudo-labeled target data may contain much instance-dependence noise, where small-loss data may not be correct ones. Thus, dual-checking process in Branch-II is ineffective.

Figures 3 and 4 show the target-domain accuracy vs. number of epochs among ATDA, Co+ATDA, B-Net-1T and B-Net. Besides, we show the accuracy of ATDA trained by clean labeled data in source domain (ATDA-CS) as a reference point. When the accuracy of one method is close to that of ATDA-CS (red dash line), this method successfully eliminates noise effects. From our observations, it is clear that B-Net is very close to ATDA-CS in out of tasks (except for SM task with P45, in Figure 3-(d)), which is an evidence of Theorem 3. Since P45 case is the hardest task, it is reasonable that B-Net cannot perfectly eliminate noise effects. An interesting phenomenon is that, B-Net outperforms ATDA-CS in MS tasks (Figure 4-(a) and (c)). This means that B-Net can transfer more useful knowledge (from noisy data in source domain to target domain) even than ATDA (from clean data in source domain to target domain).

Tasks | Type | DAN | DANN | ATDA | TCL | Co+ATDA | B-Net-1T | B-Net |

SM | P20 | 90.17% | 79.06% | 55.95% | 80.81% | 95.37% | 93.45% | 95.29% |

P45 | 67.00% | 55.34% | 53.66% | 55.97% | 75.43% | 83.53% | 90.21% | |

S20 | 90.74% | 75.19% | 89.87% | 80.23% | 95.22% | 94.44% | 95.88% | |

S45 | 89.31% | 65.87% | 87.53% | 68.54% | 92.03% | 94.89% | 94.97% | |

MS | P20 | 40.82% | 58.78% | 33.74% | 58.88% | 58.02% | 58.35% | 60.36% |

P45 | 28.41% | 43.70% | 19.50% | 45.31% | 46.80% | 54.05% | 56.62% | |

S20 | 30.62% | 53.52% | 49.80% | 56.74% | 56.64% | 54.90% | 57.05% | |

S45 | 28.21% | 43.76% | 17.20% | 49.91% | 54.29% | 57.51% | 56.18% | |

Average | 58.16% | 58.01% | 50.91% | 62.05% | 71.73% | 73.89% | 75.82% |

Results on real-world WUDA (including tasks). Finally, we show our results on real-world WUDA tasks. Table 2 reports the target-domain accuracy for tasks. B-Net enjoys the best performance on all tasks. It is noted that, in both BingCaltech and BingImageNet tasks, ATDA is slightly worse than B-Net. However, in BingSUN task, ATDA is much worse than B-Net. The reason is that the DIR between Bing and SUN are more affected by noisy data in source domain. This phenomenon is also observed when comparing DANN and TCL. Compared Co+ATDA, ATDA is slightly better than Co+ATDA. This abnormal phenomenon can be explained using Eq. (4). Call back to Eq. (4), after using Co-teaching to assign pseudo labels for source domain ( in Figure 1), the second term in may increase, which results in , i.e., noise effects actually increase. This phenomenon is an evidence that a two-step method may not really reduce the noise effects.

Target | DAN | DANN | ATDA | TCL | Co+ATDA | B-Net-1T | B-Net |
---|---|---|---|---|---|---|---|

Caltech | 77.83% | 78.00% | 80.84% | 79.35% | 79.89% | 81.26% | 81.71% |

Imagenet | 70.29% | 72.16% | 74.89% | 72.53% | 74.73% | 74.81% | 75.00% |

SUN | 24.56% | 26.80% | 26.26% | 28.80% | 26.31% | 30.45% | 30.54% |

Average | 57.56% | 58.99% | 60.66% | 60.23% | 60.31% | 62.17% | 62.42% |

## 6 Conclusions

This paper presents a robust WUDA approach called Butterfly, which can reliably transfer knowledge from noisy source data to target domain. We first reveal why current UDA methods cannot handle the noisy source data well. According our analysis, we found that a natural two-step strategy - a simple combination of a label-noise algorithm and an UDA method - cannot really eliminate noise effects. Thus, to eliminate negative effects, we propose the Butterfly to simultaneously eliminate negative effects and transfer knowledge from the high-correctness source data to the target domain. As training epochs increasing, noise effects are gradually eliminated. We compare Butterfly with current UDA methods on simulated and real-world tasks. The results show that Butterfly can well handle WUDA tasks. In future, we can extend our work to address few-shot domain adaptation problem and open-set UDA problem when the source domain contains noisy data.

#### Acknowledgments.

Prof. Masashi Sugiyama was supported by JST CREST JPMJCR1403. Prof. Jie Lu was supported by the Australian Research Council under Discovery Grant DP170101632. Feng Liu would like to thank the financial support from Center for AI, UTS, and Center for AIP, RIKEN.

## References

- [1] D. Arpit, S. Jastrzebski, N. Ballas, D. Krueger, E. Bengio, M. Kanwal, T. Maharaj, A. Fischer, A. Courville, and Y. Bengio. A closer look at memorization in deep networks. In ICML, 2017.
- [2] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan. A theory of learning from different domains. MLJ, 79(1-2):151–175, 2010.
- [3] Y. Bengio. Evolving culture versus local minima. In Growing Adaptive Machines, pages 109–138. 2014.
- [4] Y. Ganin and V. S. Lempitsky. Unsupervised domain adaptation by backpropagation. In ICML, pages 1180–1189, 2015.
- [5] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. S. Lempitsky. Domain-adversarial training of neural networks. JMLR, 17:59:1–59:35, 2016.
- [6] M. Ghifary, D. Balduzzi, W. B. Kleijn, and M. Zhang. Scatter component analysis : A unified framework for domain adaptation and domain generalization. TPAMI, 39(7):1414–1430, 2017.
- [7] B. Gong, Y. Shi, F. Sha, and K. Grauman. Geodesic flow kernel for unsupervised domain adaptation. In CVPR, pages 2066–2073, 2012.
- [8] M. Gong, K. Zhang, T. Liu, D. Tao, and C. Glymour. Domain adaptation with conditional transferable components. In ICML, pages 2839–2848, 2016.
- [9] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. J. Smola. A kernel two-sample test. JMLR, 13:723–773, 2012.
- [10] Y. Guo and M. Xiao. Cross language text classification via subspace co-regularized multi-view learning. In ICML, 2012.
- [11] B. Han, Q. Yao, X. Yu, G. Niu, M. Xu, W. Hu, I. W. Tsang, and M. Sugiyama. Co-teaching: Robust training of deep neural networks with extremely noisy labels. In NeurIPS, pages 8527–8537, 2018.
- [12] J. Hoffman, E. Tzeng, T. Park, J. Zhu, P. Isola, K. Saenko, A. A. Efros, and T. Darrell. Cycada: Cycle-consistent adversarial domain adaptation. In ICML, pages 1994–2003, 2018.
- [13] L. Jiang, Z. Zhou, T. Leung, L. Li, and F. Li. Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels. In ICML, pages 2309–2318, 2018.
- [14] J. Lee and M. Raginsky. Minimax statistical learning with wasserstein distances. In NeurIPS, pages 2692–2701, 2018.
- [15] K. Lee, X. He, L. Zhang, and L. Yang. Cleannet: Transfer learning for scalable image classifier training with label noise. In CVPR, pages 5447–5456, 2018.
- [16] T. Liu and D. Tao. Classification with noisy labels by importance reweighting. TPAMI, 38(3):447–461, 2016.
- [17] M. Long, Y. Cao, J. Wang, and M. I. Jordan. Learning transferable features with deep adaptation networks. In ICML, pages 97–105, 2015.
- [18] E. Malach and S. Shalev-Shwartz. Decoupling ”when to update” from ”how to update”. In NeurIPS, pages 961–971, 2017.
- [19] S. Motiian, Q. Jones, S. M. Iranmanesh, and G. Doretto. Few-shot adversarial domain adaptation. In NeurIPS, pages 6673–6683, 2017.
- [20] K. Saito, Y. Ushiku, and T. Harada. Asymmetric tri-training for unsupervised domain adaptation. In ICML, pages 2988–2997, 2017.
- [21] K. Saito, K. Watanabe, Y. Ushiku, and T. Harada. Maximum classifier discrepancy for unsupervised domain adaptation. In CVPR, pages 3723–3732, 2018.
- [22] F. Schroff, A. Criminisi, and A. Zisserman. Harvesting image databases from the web. TPAMI, 33(4):754–766, 2011.
- [23] Y. Shu, Z. Cao, M. Long, and J. Wang. Transferable curriculum for weakly-supervised domain adaptation. In AAAI, 2019.
- [24] T. Tommasi and T. Tuytelaars. A testbed for cross-dataset analysis. In ECCV TASK-CV Workshops, pages 18–31, 2014.
- [25] E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko. Simultaneous deep transfer across domains and tasks. In ICCV, pages 4068–4076, 2015.
- [26] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discriminative domain adaptation. In CVPR, pages 2962–2971, 2017.
- [27] M. Xiao and Y. Guo. Feature space independent semi-supervised domain adaptation via kernel matching. TPAMI, 37(1):54–66, 2015.
- [28] T. Xiao, T. Xia, Y. Yang, C. Huang, and X. Wang. Learning from massive noisy labeled data for image classification. In CVPR, pages 2691–2699, 2015.

## Appendix 0.A Review of generation of noisy labels

This section presents a review on two label-noise generation processes.

### 0.a.1 Transition matrix

We assume that there is a clean multivariate random variable () defined on with a probability density , where is a label set with labels. However, samples of () cannot be directly obtained and we only can observe noisy data from the multivariate random variable () defined on with a probability density . is generated by a transition probability , i.e., the flip rate from a clean label to a noisy label . When we generate using , we often assume that , i.e., the class conditional noise [16]. All these transition probabilities are summarized into a transition matrix , where .

The transition matrix is easily estimated in certain situations [16]. However, in more complex situations, such as clothing1M dataset [28], noisy data is directly generated by selecting data from a pool, which mixes correct data (data with correct labels) and incorrect data (data with incorrect labels). Namely, how the correct label is corrupted to () is unclear.

### 0.a.2 Sample selection

Formally, there is a multivariate random variable defined on with a probability density , where and means “correct” and means “incorrect”. Nevertheless, samples from cannot be obtained and we can only observe from a distribution with the following density.

(8) |

where . The density in Eq. (8) means that we lost the information from . If we uniformly select samples drawn from , the noisy rate of these samples is . It is clear that the multivariate random variable is the clean multivariate random variable defined in Appendix 0.A.1. Then, is used to describe the density of incorrect multivariate random variable . Using and , can be expressed by the following equation.

(9) |

where . Here, we do not assume . To reduce the negative effect that the incorrect samples bring, scholars aim to recover the information of , i.e., to select correct samples from samples drawn from [11, 13, 18].

## Appendix 0.B Transition matrix

Precise definitions of Symmetry flipping and Pair flipping are presented below, where is the noisy rate and is the number of labels.

Symmetry flipping: | |||

Pair flipping: |

## Appendix 0.C Results on Amazon products reviews

Tables 3 and 4 report the target-domain accuracy of each method for 24 human sentiment analysis tasks. For the these tasks, B-Net has the highest average accuracy. It should be noted that two-step method does not always work, such as for -noise situation. The main reason is Co-teaching performs poorly when recovering clean information from the noisy data in the source domain. Another observation is that the noise effect is not eliminated like results on SYNDMNIST. The main reason is that this datasets provides fixed features and we cannot extract better features in the traning process. However, in SYNDMNIST tasks, we can gradually obtain better features for each domain and finally eliminate noise effects.

Tasks | DAN | DANN | ATDA | TCL | Co+ATDA | B-Net-1T | B-Net |
---|---|---|---|---|---|---|---|

BD | 68.28% | 68.08% | 70.31% | 71.40% | 66.70% | 72.42% | 71.84% |

BE | 63.78% | 63.53% | 72.79% | 65.08% | 68.89% | 73.50% | 75.92% |

BK | 65.48% | 64.63% | 71.79% | 66.80% | 66.51% | 74.63% | 76.32% |

DB | 64.63% | 64.52% | 70.25% | 67.33% | 68.04% | 70.69% | 70.56% |

DE | 65.33% | 65.16% | 69.99% | 66.74% | 67.32% | 72.74% | 73.73% |

DK | 65.68% | 66.28% | 74.53% | 68.82% | 72.20% | 76.47% | 77.97% |

EB | 60.41% | 60.15% | 63.89% | 63.13% | 61.08% | 65.52% | 62.22% |

ED | 62.35% | 61.67% | 62.30% | 62.93% | 59.77% | 64.22% | 63.53% |

EK | 72.05% | 71.51% | 74.00% | 75.36% | 70.85% | 75.80% | 78.96% |

KB | 59.94% | 59.40% | 63.53% | 62.77% | 61.22% | 64.16% | 63.36% |

KD | 61.46% | 61.51% | 64.66% | 64.16% | 64.94% | 67.52% | 66.98% |

KE | 70.60% | 72.23% | 74.75% | 74.14% | 69.69% | 75.21% | 76.96% |

Average | 65.00% | 64.89% | 69.40% | 67.39% | 66.43% | 71.07% | 71.53% |

Tasks | DAN | DANN | ATDA | TCL | Co+ATDA | B-Net-1T | B-Net |
---|---|---|---|---|---|---|---|

BD | 52.43% | 52.98% | 53.56% | 54.44% | 54.32% | 54.89% | 56.59% |

BE | 52.17% | 53.50% | 55.14% | 54.14% | 57.34% | 56.93% | 55.74% |

BK | 52.89% | 51.84% | 51.14% | 53.32% | 53.28% | 58.38% | 57.00% |

DB | 53.11% | 53.04% | 54.48% | 53.27% | 55.95% | 51.37% | 55.15% |

DE | 51.30% | 53.04% | 54.21% | 53.77% | 56.08% | 55.04% | 58.91% |

DK | 52.15% | 53.17% | 57.99% | 52.45% | 59.94% | 58.43% | 66.20% |

EB | 51.38% | 51.08% | 52.54% | 52.14% | 53.30% | 50.53% | 54.93% |

ED | 52.83% | 51.24% | 49.02% | 52.57% | 49.62% | 50.11% | 52.88% |

EK | 54.21% | 53.58% | 51.66% | 55.04% | 52.10% | 48.62% | 56.12% |

KB | 50.44% | 51.77% | 51.96% | 51.50% | 52.59% | 49.88% | 51.39% |

KD | 52.20% | 51.45% | 52.86% | 53.19% | 54.52% | 52.91% | 53.53% |

KE | 54.72% | 53.33% | 52.11% | 53.46% | 52.62% | 53.11% | 53.71% |

Average | 52.49% | 52.50% | 53.65% | 53.06% | 54.31% | 53.35% | 56.01% |

## Appendix 0.D Experimental settings

Network structure and optimizer. We implement all methods on Python 3.6 with a NIVIDIA P100 GPU. We use MomentumSGD for optimization in digit and real-world tasks, and set the momentum as . We use Adagrad for optimization in human-sentiment tasks because of sparsity of review data [20]. , , and are 6-layer CNN ( convolutional layers and fully-connected layers) for digit tasks; and are 3-layer neural networks ( fully-connected layers) for human-sentiment tasks; and are -layer neural networks ( fully-connected layers) for real-world tasks. The ReLU active function is used as avtivation function of these networks. Besides, dropout and batch normalization are also used. As deep networks are highly nonconvex, even with the same network and optimization method, different initializations can lead to different local optimal. Thus, following [11, 18], we also take four networks with the same architecture but different initializations as four classifiers.

Experimental setup. For all WUDA tasks, is set to , is set to . Learning rate is set to for simulated tasks and for real-world tasks, is set to for simulated tasks and for real-world tasks. Confidence level of labelling function in line 8 of Algorithm 2 is set to for digit tasks, and for human-sentiment tasks and for real-world tasks. is set to 0.4 for digit tasks, for human-sentiment tasks and for real-world tasks. is set to for digit tasks, for human-sentiment tasks and for real-world tasks. is set to for digit tasks and for human-sentiment and real-world tasks. Batch size is set to for digit and real-world tasks and for human-sentiment tasks. Penalty parameter is set to 0.01 for digit and real-world tasks and 0.001 for human-sentiment tasks.

To fairly compare all methods, they have the same network structure. Namely, ATDA, DAN, DANN, TCL, B-Net-1T and B-Net adopt the same network structure for each dataset. Note that DANN and TCL use the same structure for their discriminate networks. All experiments are repeated ten times and we report the average accuracy value and STD of accuracy values of ten experiments.