Multinomial Adversarial Networks for MultiDomain Text Classification
Abstract
Many text classification tasks are known to be highly domaindependent. Unfortunately, the availability of training data can vary drastically across domains. Worse still, for some domains there may not be any annotated data at all. In this work, we propose a multinomial adversarial network (MAN) to tackle the text classification problem in this realworld multidomain setting (MDTC). We provide theoretical justifications for the MAN framework, proving that different instances of MANs are essentially minimizers of various fdivergence metrics Ali and Silvey (1966) among multiple probability distributions. MANs are thus a theoretically sound generalization of traditional adversarial networks that discriminate over two distributions. More specifically, for the MDTC task, MAN learns features that are invariant across multiple domains by resorting to its ability to reduce the divergence among the feature distributions of each domain. We present experimental results showing that MANs significantly outperform the prior art on the MDTC task. We also show that MANs achieve stateoftheart performance for domains with no labeled data.
The source code of MAN can be found at https://github.com/ccsasuke/man
1 Introduction
Text classification is one of the most fundamental tasks in Natural Language Processing, and has found its way into a wide spectrum of NLP applications, ranging from email spam detection and social media analytics to sentiment analysis and data mining. Over the past couple of decades, supervised statistical learning methods have become the dominant approach for text classification (e.g. \newcitemccallum1998comparison,D141181,P151162). Unfortunately, many text classification tasks are highly domaindependent in that a text classifier trained using labeled data from one domain is likely to perform poorly on another. In the task of sentiment classification, for example, a phrase “runs fast” is usually associated with positive sentiment in the sports domain; not so when a user is reviewing the battery of an electronic device. In real applications, therefore, an adequate amount of training data from each domain of interest is typically required, and this is expensive to obtain.
Two major lines of work attempt to tackle this challenge: domain adaptation Blitzer et al. (2007) and multidomain text classification (MDTC) Li and Zong (2008). In domain adaptation, the assumption is that there is some domain with abundant training data (the source domain), and the goal is to utilize knowledge learned from the source domain to help perform classifications on another lowerresourced target domain.^{1}^{1}1Review §5 for other variants of domain adaptation. The focus of this work, MDTC, instead simulates an arguably more realistic scenario, where labeled data may exist for multiple domains, but in insufficient amounts to train an effective classifier for one or more of the domains. Worse still, some domains may have no labeled data at all. The objective of MDTC is to leverage all the available resources in order to improve the system performance over all domains simultaneously.
One stateoftheart system for MDTC, the CMSC of \newcitewu2015, combines a classifier that is shared across all domains (for learning domaininvariant knowledge) with a set of classifiers, one per domain, each of which captures domainspecific text classification knowledge. This paradigm is sometimes known as the SharedPrivate model Bousmalis et al. (2016). CMSC, however, lacks an explicit mechanism to ensure that the shared classifier captures only domainindependent knowledge: the shared classifier may well also acquire some domainspecific features that are useful for a subset of the domains. We hypothesize that better performance can be obtained if this constraint were explicitly enforced.
In this paper, we thus propose Multinomial Adversarial Networks (henceforth, MANs) for the task of multidomain text classification. In contrast to standard adversarial networks Goodfellow et al. (2014), which serve as a tool for minimizing the divergence between two distributions Nowozin et al. (2016), MANs represent a family of theoretically sound adversarial networks that, in contrast, leverage a multinomial discriminator to directly minimize the divergence among multiple probability distributions. And just as binomial adversarial networks have been applied to numerous tasks (e.g. image generation Goodfellow et al. (2014), domain adaptation Ganin et al. (2016), crosslingual sentiment analysis Chen et al. (2016)), we anticipate that MANs will make a versatile machine learning framework with applications beyond the MDTC task studied in this work.
We introduce the MAN architecture in §2 and prove in §3 that it directly minimizes the (generalized) fdivergence among multiple distributions so that they are indistinguishable upon successful training. Specifically for MDTC, MAN is used to overcome the aforementioned limitation in prior art where domainspecific features may sneak into the shared model. This is done by relying on MAN’s power of minimizing the divergence among the feature distributions of each domain. The highlevel idea is that MAN will make the extracted feature distributions of each domain indistinguishable from one another, thus learning general features that are invariant across domains.
We then validate the effectiveness of MAN in experiments on two MDTC data sets. We find first that MAN significantly outperforms the stateoftheart CMSC method Wu and Huang (2015) on the widely used multidomain Amazon review dataset, and does so without relying on external resources such as sentiment lexica (§4.1). When applied to the FDUMTL dataset (§4.3), we obtain similar results: MAN achieves substantially higher accuracy than the previous topperforming method, ASPMTL Liu et al. (2017). ASPMTL is the first empirical attempt to use a multinomial adversarial network proposed for a multitask learning setting, but is more restricted and can be viewed as a special case of MAN. In addition, we for the first time provide theoretical guarantees for MAN (§3) that were absent in ASPMTL. Finally, while many MDTC methods such as CMSC require labeled data for each domain, MANs can be applied in cases where no labeled data exists for a subset of domains. To evaluate MAN in this semisupervised setting, we compare MAN to a method that can accommodate unlabeled data for (only) one domain Zhao et al. (2017), and show that MAN achieves performance comparable to the state of the art (§4.2).
2 Model
In this paper, we strive to tackle the text classification problem in a realworld setting in which texts come from a variety of domains, each with a varying amount of labeled data. Specifically, assume we have a total domains, labeled domains (denoted as ) for which there is some labeled data, and unlabeled domains () for which no annotated training instances are available. Denote as the collection of all domains, with being the total number of domains we are faced with. The goal of this work, and MDTC in general, is to improve the overall classification performance across all domains, measured in this paper as the average classification accuracy across the domains in .
2.1 Model Architecture
As shown in Figure 1, the Multinomial Adversarial Network (MAN) adopts the SharedPrivate paradigm of \newciteNIPS2016_6254 and consists of four components: a shared feature extractor , a domain feature extractor for each labeled domain , a text classifier , and finally a domain discriminator . The main idea of MAN is to explicitly model the domaininvariant features that are beneficial to the main classification task across all domains (i.e. shared features, extracted by ), as well as the domainspecific features that mainly contribute to the classification in its own domain (domain features, extracted by ). Here, the adversarial domain discriminator has a multinomial output that takes a shared feature vector and predicts the likelihood of that sample coming from each domain. As seen in Figure 1 during the training flow of (green arrows), aims to confuse by minimizing that is anticorrelated to (detailed in §2.2), so that cannot predict the domain of a sample given its shared features. The intuition is that if even a strong discriminator cannot tell the domain of a sample from the extracted features, those features learned are essentially domain invariant. By enforcing domaininvariant features to be learned by , when trained jointly via backpropagation, the set of domain features extractors will each learn domainspecific features beneficial within its own domain.
The architecture of each component is relatively flexible, and can be decided by the practitioners to suit their particular classification tasks. For instance, the feature extractors can adopt the form of Convolutional Neural Nets (CNN), Recurrent Neural Nets (RNN), or a MultiLayer Perceptron (MLP), depending on the input data (See §4). The input of MAN will also be dependent on the feature extractor choice. The output of a (shared/domain) feature extractor is a fixedlength vector, which is considered the (shared/domain) hidden features of some given input text. On the other hand, the outputs of and are label probabilities for class and domain prediction, respectively. For example, both and can be MLPs with a softmax layer on top. In §3, we provide alternative architectures for and their mathematical implications. We now present detailed descriptions of the MAN training in §2.2 as well as the theoretical grounds in §3.
2.2 Training
Denote the annotated corpus in a labeled domain as ; and is a sample drawn from the labeled data in domain , where is the input and is the task label. On the other hand, for any domain , denote the unlabeled corpus as . Note for a labeled domain, one can use a separate unlabeled corpus or simply use the labeled data (or use both).
In Figure 1, the arrows illustrate the training flows of various components. Due to the adversarial nature of the domain discriminator , it is trained with a separate optimizer (red arrows), while the rest of the networks are updated with the main optimizer (green arrows). is only trained on labeled domains, and it takes as input the concatenation of the shared and domain feature vectors. At test time for unlabeled domains with no , the domain features are set to the vector for ’s input. On the contrary, only takes the shared features as input, for both labeled and unlabeled domains. The MAN training is described in Algorithm 1.
In Algorithm 1, and are the loss functions of the text classifier and the domain discriminator , respectively. As mentioned in §2.1, has a layer on top for classification. We hence adopt the canonical negative loglikelihood (NLL) loss:
(1) 
where is the true label and is the predictions. For , we consider two variants of MAN. The first one is to use the NLL loss same as which suits the classification task; while another option is to use the LeastSquare (L2) loss that was shown to be able to alleviate the gradient vanishing problem when using the NLL loss in the adversarial setting Mao et al. (2017):
(2)  
(3) 
where is the domain index of some sample and is the prediction. Without loss of generality, we normalize so that and .
Therefore, the objectives of and that we are minimizing are:
(4)  
(5) 
For the feature extractors, the training of domain feature extractors is straightforward, as their sole objective is to help perform better within their own domain. Hence, for any domain . Finally, the shared feature extractor has two objectives: to help achieve higher accuracy, and to make the feature distribution invariant across all domains. It thus leads to the following bipartite loss:
where is a hyperparameter balancing the two parts. is the domain loss of anticorrelated to :
(6)  
(7) 
If adopts the NLL loss (6), the domain loss is simply . For the L2 loss (7), intuitively translates to pushing to make random predictions. See §3 for theoretical justifications.
3 Theories of Multinomial Adversarial Networks
The binomial adversarial nets are known to have theoretical connections to the minimization of various fdivergences between two distributions Nowozin et al. (2016). However, for adversarial training among multiple distributions, despite similar idea has been empirically experimented Liu et al. (2017), no theoretical justifications have been provided to our best knowledge.
In this section, we present a theoretical analysis showing the validity of MAN. In particular, we show that MAN’s objective is equivalent to minimizing the total fdivergence between each of the shared feature distributions of the domains, and the centroid of the distributions. The choice of loss function will determine which specific fdivergence is minimized. Furthermore, with adequate model capacity, MAN achieves its optimum for either loss function if and only if all shared feature distributions are identical, hence learning an invariant feature space across all domains.
First consider the distribution of the shared features for instances in each domain :
(8) 
Combining (5) with the two loss functions (2), (3), the objective of can be written as:
(9)  
(10) 
where is the th dimension of ’s (normalized) output vector, which conceptually corresponds to the probability of predicting that is from domain .
We first derive the optimal for any fixed .
Lemma 1.
For any fixed , with either NLL or L2 loss, the optimum domain discriminator is:
(11) 
The proof involves an application of the Lagrangian Multiplier to solve the minimum value of , and the details can be found in the Appendix. We then have the following main theorems for the domain loss for :
Theorem 1.
Let . When is trained to its optimality, if adopts the NLL loss:
where is the generalized JensenShannon Divergence Lin (1991) among multiple distributions, defined as the average KullbackLeibler divergence of each to the centroid Aslam and Pavlu (2007).
Theorem 2.
If uses the L2 loss:
where is the Neyman divergence Nielsen and Nock (2014). The proof of both theorems can be found in the Appendix.
Consequently, by the nonnegativity and joint convexity of the fdivergence Csiszar and Korner (1982), we have:
Corollary 1.
The optimum of is when using NLL loss, and for the L2 loss. The optimum value above is achieved if and only if for either loss.
Therefore, the loss of can be interpreted as simultaneously minimizing the classification loss as well as the divergence among feature distributions of all domains. It can thus learn a shared feature mapping that are invariant across domains upon successful training while being beneficial to the main classification task.
4 Experiments
4.1 MultiDomain Text Classification
In this experiment, we compare MAN to stateoftheart MDTC systems, on the multidomain Amazon review dataset Blitzer et al. (2007), which is one of the most widely used MDTC datasets. Note that this dataset was already preprocessed into a bag of features (unigrams and bigrams), losing all word order information. This prohibits the usage of CNNs or RNNs as feature extractors, limiting the potential performance of the system. Nonetheless, we adopt the same dataset for fair comparison and employ a MLP as our feature extractor. In particular, we take the 5000 most frequent features and represent each review as a 5000d feature vector, where feature values are raw counts of the features. Our MLP feature extractor would then have an input size of 5000 in order to process the reviews.
The Amazon dataset contains 2000 samples for each of the four domains: book, DVD, electronics, and kitchen, with binary labels (positive, negative). Following \newcitewu2015, we conduct 5way cross validation. Three out of the five folds are treated as training set, one serves as the validation set, while the remaining being the test set. The 5fold average test accuracy is reported.
Table 1 shows the main results. Three types of models are shown: DomainSpecific Models Only, where only indomain models are trained^{2}^{2}2For our models, it means is disabled. Similarly, for Shared Model Only, no is used.; Shared Model Only, where a single model is trained with all data; and SharedPrivate Models, a combination of the previous two. Within each category, various architectures are examined, such as Least Square (LS), SVM, and Logistic Regression (LR). As explained before, we use MLP as our feature extractors for all our models (bold ones). Among our models, the ones with the MAN prefix use adversarial training, and MANL2 and MANNLL indicate the L2 loss and NLL loss MAN, respectively.
From Table 1, we can see that by adopting modern deep neural networks, our methods achieve superior performance within the first two model categories even without adversarial training. This is corroborated by the fact that our SPMLP model performs comparably to CMSC, while the latter relies on external resources such as sentiment lexica. Moreover, when our multinomial adversarial nets are introduced, further improvement is observed. With both loss functions, MAN outperforms all SharedPrivate baseline systems on each domain, and achieves statistically significantly higher overall performance. For our MANSP models, we provide the mean accuracy as well as the standard errors over five runs, to illustrate the performance variance and conduct significance test. It can be seen that MAN’s performance is relatively stable, and consistently outperforms CMSC.
4.2 Experiments for Unlabeled Domains
As CMSC requires labeled data for each domain, their experiments were naturally designed this way. In reality, however, many domains may not have any annotated corpora available. It is therefore also important to look at the performance in these unlabeled domains for a MDTC system. Fortunately, as depicted before, MAN’s adversarial training only utilizes unlabeled data from each domain to learn the domaininvariant features, and can thus be used on unlabeled domains as well. During testing, only the shared feature vector is fed into , while the domain feature vector is set to .
Target Domain  Book  DVD  Elec.  Kit.  Avg. 

MLP  76.55  75.88  84.60  85.45  80.46 
mSDA^{1}  76.98  78.61  81.98  84.26  80.46 
DANN^{2}  77.89  78.86  84.91  86.39  82.01 
MDAN (HMAX)^{3}  78.45  77.97  84.83  85.80  81.76 
MDAN (SMAX)^{3}  78.63  80.65  85.34  86.26  82.72 
MANL2SPMLP  78.45  81.57  83.37  85.57  82.24 
MANNLLSPMLP  77.78  82.74  83.75  86.41  82.67 

\newcite
ICML2012Chen_416

\newcite
ganin2016domain

\newcite
DBLP:journals/corr/ZhaoZWCMG17
In order to validate MAN’s effectiveness, we compare to stateoftheart multisource domain adaptation (MSDA) methods (See §5). Compared to standard domain adaptation methods with one source and one target domain, MSDA allows the adaptation from multiple source domains to a single target domain. Analogically, MDTC can be viewed as multisource multitarget domain adaptation, which is superior when multiple target domains exist. With multiple target domains, MSDA will need to treat each one as an independent task, which is more expensive and cannot utilize the unlabeled data in other target domains.
In this work, we compare MAN with one recent MSDA method, MDAN Zhao et al. (2017). Their experiments only have one target domain to suit their approach, and we follow this setting for fair comparison. However, it is worth noting that MAN is designed for the MDTC setting, and can deal with multiple target domains at the same time, which can potentially improve the performance by taking advantage of more unlabeled data from multiple target domains during adversarial training. We adopt the same setting as \newciteDBLP:journals/corr/ZhaoZWCMG17, which is based on the same multidomain Amazon review dataset. Each of the four domains in the dataset is treated as the target domain in four separate experiments, while the remaining three are used as source domains.
books  elec.  dvd  kitchen  apparel  camera  health  music  toys  video  baby  magaz.  softw.  sports  IMDb  MR  Avg.  
DomainSpecific Models Only  
BiLSTM  81.0  78.5  80.5  81.2  86.0  86.0  78.7  77.2  84.7  83.7  83.5  91.5  85.7  84.0  85.0  74.7  82.6 
CNN  85.3  87.8  76.3  84.5  86.3  89.0  87.5  81.5  87.0  82.3  82.5  86.8  87.5  85.3  83.3  75.5  84.3 
Shared Model Only  
FSMTL  82.5  85.7  83.5  86.0  84.5  86.5  88.0  81.2  84.5  83.7  88.0  92.5  86.2  85.5  82.5  74.7  84.7 
MANL2CNN  88.3  88.3  87.8  88.5  85.3  90.5  90.8  85.3  89.5  89.0  89.5  91.3  88.3  89.5  88.5  73.8  87.7 
MANNLLCNN  88.0  87.8  87.3  88.5  86.3  90.8  89.8  84.8  89.3  89.3  87.8  91.8  90.0  90.3  87.3  73.5  87.6 
SharedPrivate Models  
ASPMTL  84.0  86.8  85.5  86.2  87.0  89.2  88.2  82.5  88.0  84.5  88.2  92.2  87.2  85.7  85.5  76.7  86.1 
MANL2SPCNN  87.6*  87.4  88.1*  89.8*  87.6  91.4*  89.8*  85.9*  90.0*  89.5*  90.0  92.5  90.4*  89.0*  86.6  76.1  88.2* 
MANNLLSPCNN  86.8*  88.8  88.6*  89.9*  87.6  90.7  89.4  85.5*  90.4*  89.6*  90.2  92.9  90.9*  89.0*  87.0*  76.7  88.4* 
In Table 2, the target domain is shown on top, and the test set accuracy is reported for various systems. It shows that MAN outperforms several baseline systems, such as a MLP trained on the sourcedomains, as well as singlesource domain adaptation methods such as mSDA Chen et al. (2012) and DANN Ganin et al. (2016), where the training data in the multiple source domains are combined and viewed as a single domain. Finally, when compared to MDAN, MAN and MDAN each achieves higher accuracy on two out of the four target domains, and the average accuracy of MAN is similar to MDAN. Therefore, MAN achieves competitive performance for the domains without annotated corpus. Nevertheless, unlike MSDA methods, MAN can handle multiple target domains at one time.
4.3 Experiments on the MTL Dataset
To make fair comparisons, the previous experiments follow the standard settings in the literature, where the widely adopted Amazon review dataset is used. However, this dataset has a few limitations: First, it has only four domains. In addition, the reviews are already tokenized and converted to a bag of features consisting of unigrams and bigrams. Raw review texts are hence not available in this dataset, making it impossible to use certain modern neural architectures such as CNNs and RNNs. To provide more insights on how well MAN work with other feature extractor architectures, we provide a third set of experiments on the FDUMTL dataset Liu et al. (2017). The dataset is created as a multitask learning dataset with 16 tasks, where each task is essentially a different domain of reviews. It has 14 Amazon domains: books, electronics, DVD, kitchen, apparel, camera, health, music, toys, video, baby, magazine, software, and sports, in addition to two movies review domains from the IMDb and the MR dataset. Each domain has a development set of 200 samples, and a test set of 400 samples. The amount of training and unlabeled data vary across domains but are roughly 1400 and 2000, respectively.
We compare MAN with ASPMTL Liu et al. (2017) on this FDUMTL dataset. ASPMTL also adopts adversarial training for learning a shared feature space, and can be viewed as a special case of MAN when adopting the NLL loss (MANNLL). Furthermore, while \newciteP171001 do not provide any theoretically justifications, we in §3 prove the validity of MAN for not only the NLL loss, but an additional L2 loss. Besides the theoretical superiority, we in this section show that MAN also substantially outperforms ASPMTL in practice due to the feature extractor choice.
In particular, \newciteP171001 choose LSTM as their feature extractor, yet we found CNN Kim (2014) to achieve much better accuracy while being times faster. Indeed, as shown in Table 3, with or without adversarial training, our CNN models outperform LSTM ones by a large margin. When MAN is introduced, we attain the stateoftheart performance on every domain with a 88.4% overall accuracy, surpassing ASPMTL by a significant margin of 2.3%.
We hypothesize the reason LSTM performs much inferior to CNN is attributed to the lack of attention mechanism. In ASPMTL, only the last hidden unit is taken as the extracted features. While LSTM is effective for representing the context for each token, it might not be powerful enough for directly encoding the entire document Bahdanau et al. (2015). Therefore, various attention mechanisms have been introduced on top of the vanilla LSTM to select words (and contexts) most relevant for making the predictions. In our preliminary experiments, we find that Bidirectional LSTM with the dotproduct attention Luong et al. (2015) yields better performance than the vanilla LSTM in ASPMTL. However, it still does not outperform CNN and is much slower. As a result, we conclude that, for text classification tasks, CNN is both effective and efficient in extracting local and higherlevel features for making a single categorization.
Finally, we observe that MANNLL achieves slightly higher overall performance compared to MANL2, providing evidence for the claim in a recent study Lucic et al. (2017) that the original GAN loss (NLL) may not be inherently inferior. Moreover, the two variants excel in different domains, suggesting the possibility of further performance gain when using ensemble.
5 Related Work
MultiDomain Text Classification
The MDTC task was first examined by \newciteP082065, who proposed to fusion the training data from multiple domains either on the feature level or the classifier level. The prior art of MDTC Wu and Huang (2015) decomposes the text classifier into a general one and a set of domainspecific ones. However, the general classifier is learned by parameter sharing and domainspecific knowledge may sneak into it. They also require external resources to help improve accuracy and compute domain similarities.
Domain Adaptation
Domain Adaptation attempts to transfer the knowledge from a source domain to a target one, and the traditional form is the singlesource, singletarget (SS,ST) adaptation Blitzer et al. (2006). Another variant is the SS,MT adaptation Yang and Eisenstein (2015), which tries to simultaneously transfer the knowledge to multiple target domains from a single source. However, it cannot fully take advantage the training data if it comes from multiple source domains. MS,ST adaptation Mansour et al. (2009); Zhao et al. (2017) can deal with multiple source domains but only transfers to a single target domain. Therefore, when multiple target domains exist, they need to treat them as independent problems, which is more expensive and cannot utilize the additional unlabeled data in these domains. Finally, MDTC can be viewed as MS,MT adaptation, which is arguably more general and realistic.
Adversarial Networks
The idea of adversarial networks was proposed by \newciteNIPS2014_5423 for image generation, and has been applied to various NLP tasks as well Chen et al. (2016); Li et al. (2017). \newciteganin2016domain first used it for the SS,ST domain adaptation followed by many others. \newciteNIPS2016_6254 utilized adversarial training in a sharedprivate model for domain adaptation to learn domaininvariant features, but still focused on the SS,ST setting. Finally, the idea of using adversarial nets to discriminate over multiple distributions was empirically explored by a very recent work Liu et al. (2017) under the multitask learning setting, and can be considered as a special case of our MAN framework with the NLL domain loss. Nevertheless, we propose a more general framework with alternative architectures for the adversarial component, and for the first time provide theoretical justifications for the multinomial adversarial nets. Moreover, \newciteP171001 used LSTM without attention as their feature extractor, which we found to perform suboptimal in the experiments. We instead chose Convolutional Neural Nets as our feature extractor that achieves higher accuracy while running an order of magnitude faster (See §4.3).
6 Conclusion
In this work, we propose a family of Multinomial Adversarial Networks (MAN) that generalize the traditional binomial adversarial nets in the sense that MAN can simultaneously minimize the difference among multiple probability distributions instead of two. We provide theoretical justifications for two instances of MAN, MANNLL and MANL2, showing they are minimizers of two different fdivergence metrics among multiple distributions, respectively. This indicates MAN can be used to make multiple distributions indistinguishable from one another. It can hence be applied to a variety of tasks, similar to the versatile binomial adversarial nets, which have been used in many areas for making two distributions alike.
We in this paper design a MAN model for the MDTC task, following the sharedprivate paradigm that has a shared feature extractor to learn domaininvariant features and domain feature extractors to learn domainspecific ones. MAN is used to enforce the shared feature extractor to learn only domaininvariant knowledge, by resorting to MAN’s power of making indistinguishable the shared feature distributions of samples from each domain. We conduct extensive experiments, demonstrating our MAN model outperforms the prior art systems in MDTC, and achieves stateoftheart performance on domains without labeled data when compared to multisource domain adaptation methods.
References
 Ali and Silvey (1966) S. M. Ali and S. D. Silvey. 1966. A general class of coefficients of divergence of one distribution from another. Journal of the Royal Statistical Society. Series B (Methodological) 28(1):131–142. http://www.jstor.org/stable/2984279.
 Aslam and Pavlu (2007) Javed A. Aslam and Virgil Pavlu. 2007. Query Hardness Estimation Using JensenShannon Divergence Among Multiple Scoring Functions, Springer Berlin Heidelberg, Berlin, Heidelberg, pages 198–209. https://doi.org/10.1007/9783540714965_20.
 Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations (ICLR 2015). http://arxiv.org/abs/1409.0473.
 Blitzer et al. (2007) John Blitzer, Mark Dredze, and Fernando Pereira. 2007. Biographies, bollywood, boomboxes and blenders: Domain adaptation for sentiment classification. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics. Association for Computational Linguistics, pages 440–447. http://aclanthology.coli.unisaarland.de/pdf/P/P07/P071056.pdf.
 Blitzer et al. (2006) John Blitzer, Ryan McDonald, and Fernando Pereira. 2006. Domain adaptation with structural correspondence learning. http://www.aclweb.org/anthology/W061615.
 Bousmalis et al. (2016) Konstantinos Bousmalis, George Trigeorgis, Nathan Silberman, Dilip Krishnan, and Dumitru Erhan. 2016. Domain separation networks. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, Curran Associates, Inc., pages 343–351. http://papers.nips.cc/paper/6254domainseparationnetworks.pdf.
 Chen et al. (2012) Minmin Chen, Zhixiang Xu, Kilian Weinberger, and Fei Sha. 2012. Marginalized denoising autoencoders for domain adaptation. In John Langford and Joelle Pineau, editors, Proceedings of the 29th International Conference on Machine Learning (ICML12), ACM, New York, NY, USA, ICML ’12, pages 767–774.
 Chen et al. (2016) Xilun Chen, Yu Sun, Ben Athiwaratkun, Claire Cardie, and Kilian Weinberger. 2016. Adversarial Deep Averaging Networks for CrossLingual Sentiment Classification. ArXiv eprints https://arxiv.org/abs/1606.01614.
 Csiszar and Korner (1982) Imre Csiszar and Janos Korner. 1982. Information Theory: Coding Theorems for Discrete Memoryless Systems. Academic Press, Inc., Orlando, FL, USA.
 Evgeniou and Pontil (2004) Theodoros Evgeniou and Massimiliano Pontil. 2004. Regularized multi–task learning. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, NY, USA, KDD ’04, pages 109–117. https://doi.org/10.1145/1014052.1014067.
 Ganin et al. (2016) Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. 2016. Domainadversarial training of neural networks. Journal of Machine Learning Research 17(59):1–35.
 Goodfellow et al. (2014) Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, Curran Associates, Inc., pages 2672–2680. http://papers.nips.cc/paper/5423generativeadversarialnets.pdf.
 Ioffe and Szegedy (2015) Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning. pages 448–456. http://jmlr.org/proceedings/papers/v37/ioffe15.pdf.
 Iyyer et al. (2015) Mohit Iyyer, Varun Manjunatha, Jordan BoydGraber, and Hal Daumé III. 2015. Deep unordered composition rivals syntactic methods for text classification. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, pages 1681–1691. https://doi.org/10.3115/v1/P151162.
 Kim (2014) Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, pages 1746–1751. https://doi.org/10.3115/v1/D141181.
 Kingma and Ba (2015) Diederik Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In International Conference on Learning Representations. https://arxiv.org/abs/1412.6980.
 Li et al. (2017) Jiwei Li, Will Monroe, Tianlin Shi, Sėbastien Jean, Alan Ritter, and Dan Jurafsky. 2017. Adversarial learning for neural dialogue generation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, pages 2157–2169. http://www.aclweb.org/anthology/D171230.
 Li and Zong (2008) Shoushan Li and Chengqing Zong. 2008. Multidomain sentiment classification. In Proceedings of ACL08: HLT, Short Papers. Association for Computational Linguistics, pages 257–260. http://aclanthology.coli.unisaarland.de/pdf/P/P08/P082065.pdf.
 Lin (1991) J. Lin. 1991. Divergence measures based on the shannon entropy. IEEE Transactions on Information Theory 37(1):145–151. https://doi.org/10.1109/18.61115.
 Liu et al. (2017) Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. 2017. Adversarial multitask learning for text classification. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, pages 1–10. https://doi.org/10.18653/v1/P171001.
 Lucic et al. (2017) M. Lucic, K. Kurach, M. Michalski, S. Gelly, and O. Bousquet. 2017. Are GANs Created Equal? A LargeScale Study. ArXiv eprints .
 Luong et al. (2015) Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attentionbased neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, pages 1412–1421. https://doi.org/10.18653/v1/D151166.
 Mansour et al. (2009) Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. 2009. Domain adaptation with multiple sources. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems 21, Curran Associates, Inc., pages 1041–1048. http://papers.nips.cc/paper/3550domainadaptationwithmultiplesources.pdf.
 Mao et al. (2017) Xudong Mao, Qing Li, Haoran Xie, Raymond Y.K. Lau, Zhen Wang, and Stephen Paul Smolley. 2017. Least squares generative adversarial networks. In The IEEE International Conference on Computer Vision (ICCV).
 McCallum et al. (1998) Andrew McCallum, Kamal Nigam, et al. 1998. A comparison of event models for naive bayes text classification. In AAAI98 workshop on learning for text categorization. Madison, WI, volume 752, pages 41–48.
 Mikolov et al. (2013) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 .
 Nielsen and Nock (2014) Frank Nielsen and Richard Nock. 2014. On the chi square and higherorder chi distances for approximating fdivergences. IEEE Signal Processing Letters 21(1):10–13.
 Nowozin et al. (2016) Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. 2016. fgan: Training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems. pages 271–279.
 Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in pytorch. NIPS 2017 Autodiff Workshop .
 Wu and Huang (2015) F. Wu and Y. Huang. 2015. Collaborative multidomain sentiment classification. In 2015 IEEE International Conference on Data Mining. pages 459–468. https://doi.org/10.1109/ICDM.2015.68.
 Yang and Eisenstein (2015) Yi Yang and Jacob Eisenstein. 2015. Unsupervised multidomain adaptation with feature embeddings. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, pages 672–682. https://doi.org/10.3115/v1/N151069.
 Zhao et al. (2017) Han Zhao, Shanghang Zhang, Guanhang Wu, João P. Costeira, José M. F. Moura, and Geoffrey J. Gordon. 2017. Multiple source domain adaptation with adversarial training of neural networks. CoRR abs/1705.09684. http://arxiv.org/abs/1705.09684.
 Zhou et al. (2011) J. Zhou, J. Chen, and J. Ye. 2011. MALSAR: MultitAsk Learning via StructurAl Regularization. Arizona State University. http://www.public.asu.edu/~jye02/Software/MALSAR.
Appendix A Proofs
a.1 Proofs for ManNll
Assume we have domains, consider the distribution of the shared features for instances in each domain :
The objective that attempts to minimize is:
(12) 
where is the th dimension of ’s output vector, which conceptually corresponds to the softmax probability of predicting that is from domain . We therefore have property that for any :
(13) 
Lemma 2.
For any fixed , the optimum domain discriminator is:
(14) 
Proof.
For a fixed , the optimum
We employ the Lagrangian Multiplier to derive under the constraint of (13). Let
Let :
Solving the two equations, we have:
∎
On the other hand, the loss function of the shared feature extractor consists of two additive components, the loss from the text classifier , and the loss from the domain discriminator :
(15) 
We have the following theorem for the domain loss for :
Theorem 3.
When is trained to its optimality:
(16) 
where is the generalized JensenShannon Divergence Lin (1991) among multiple distributions.
Proof.
Let .
There are two equivalent definitions of the generalized JensenShannon divergence: the original definition based on Shannon entropy Lin (1991), and a reshaped one expressed as the average KullbackLeibler divergence of each to the centroid Aslam and Pavlu (2007). We adopt the latter one here:
(17) 
Now substituting into :
∎
Consequently, by the nonnegativity of Lin (1991), we have the following corollary:
Corollary 2.
The optimum of is , and is achieved if and only if .
a.2 Proofs for ManL2
The proof is similar for MAN with the L2 loss. The loss function used by is, for a sample from domain with shared feature vector :
(18) 
So the objective that minimizes is:
(19) 
For simplicity, we further constrain ’s outputs to be on a simplex:
(20) 
Lemma 3.
For any fixed , the optimum domain discriminator is:
(21) 
Proof.
For a fixed , the optimum
Similar to MANNLL, we employ the Lagrangian Multiplier to derive under the constraint of (20). Let :
Solving the two equations, we have and:
∎
For the domain loss of :
Theorem 4.
Let . When is trained to its optimality:
(22) 
where is the Neyman divergence Nielsen and Nock (2014).
Proof.
Substituting into :
∎
Finally, by the joint convexity of fdivergence, we have the following corollary:
Corollary 3.
and the equality is attained if and only if .
Appendix B Implementation Details
For all three of our experiments, we use and (See Algorithm 1). For both optimizers, Adam Kingma and Ba (2015) is used with learning rate . The size of the shared feature vector is set to while that of the domain feature vector is . Dropout of is used in all components. and each has one hidden layer of the same size as their input ( for and for ). ReLU is used as the activation function. Batch normalization Ioffe and Szegedy (2015) is used in both and but not . We use a batch size of 8.
For our first two experiments on the Amazon review dataset, the MLP feature extractor is used. As described in the paper, it has an input size of 5000. Two hidden layers are used, with size and , respectively.
For the CNN feature extractor used in the FDUMTL experiment, a single convolution layer is used. The kernel sizes are 3, 4, and 5, and the number of kernels are 200. The convolution layers take as input the 100d word embeddings of each word in the input sequence. We use word2vec word embeddings Mikolov et al. (2013) trained on a bunch of unlabeled raw Amazon reviews Blitzer et al. (2007). After convolution, the outputs go through a ReLU layer before fed into a max pooling layer. The pooled output is then fed into a single fully connected layer to be converted into a feature vector of size either 128 or 64. More details of using CNN for text classification can be found in the original paper Kim (2014). MAN is implemented using PyTorch Paszke et al. (2017).