Multinomial Adversarial Networks for Multi-Domain Text Classification

Multinomial Adversarial Networks for Multi-Domain Text Classification

Xilun Chen
Department of Computer Science
Cornell Unversity
Ithaca, NY, 14853, USA
xlchen@cs.cornell.edu
&Claire Cardie
Department of Computer Science
Cornell Unversity
Ithaca, NY, 14853, USA
cardie@cs.cornell.edu
Abstract

Many text classification tasks are known to be highly domain-dependent. Unfortunately, the availability of training data can vary drastically across domains. Worse still, for some domains there may not be any annotated data at all. In this work, we propose a multinomial adversarial network (MAN) to tackle the text classification problem in this real-world multi-domain setting (MDTC). We provide theoretical justifications for the MAN framework, proving that different instances of MANs are essentially minimizers of various f-divergence metrics Ali and Silvey (1966) among multiple probability distributions. MANs are thus a theoretically sound generalization of traditional adversarial networks that discriminate over two distributions. More specifically, for the MDTC task, MAN learns features that are invariant across multiple domains by resorting to its ability to reduce the divergence among the feature distributions of each domain. We present experimental results showing that MANs significantly outperform the prior art on the MDTC task. We also show that MANs achieve state-of-the-art performance for domains with no labeled data.

\@footnotetext

The source code of MAN can be found at https://github.com/ccsasuke/man

1 Introduction

Text classification is one of the most fundamental tasks in Natural Language Processing, and has found its way into a wide spectrum of NLP applications, ranging from email spam detection and social media analytics to sentiment analysis and data mining. Over the past couple of decades, supervised statistical learning methods have become the dominant approach for text classification (e.g. \newcitemccallum1998comparison,D14-1181,P15-1162). Unfortunately, many text classification tasks are highly domain-dependent in that a text classifier trained using labeled data from one domain is likely to perform poorly on another. In the task of sentiment classification, for example, a phrase “runs fast” is usually associated with positive sentiment in the sports domain; not so when a user is reviewing the battery of an electronic device. In real applications, therefore, an adequate amount of training data from each domain of interest is typically required, and this is expensive to obtain.

Two major lines of work attempt to tackle this challenge: domain adaptation Blitzer et al. (2007) and multi-domain text classification (MDTC) Li and Zong (2008). In domain adaptation, the assumption is that there is some domain with abundant training data (the source domain), and the goal is to utilize knowledge learned from the source domain to help perform classifications on another lower-resourced target domain.111Review §5 for other variants of domain adaptation. The focus of this work, MDTC, instead simulates an arguably more realistic scenario, where labeled data may exist for multiple domains, but in insufficient amounts to train an effective classifier for one or more of the domains. Worse still, some domains may have no labeled data at all. The objective of MDTC is to leverage all the available resources in order to improve the system performance over all domains simultaneously.

One state-of-the-art system for MDTC, the CMSC of \newcitewu2015, combines a classifier that is shared across all domains (for learning domain-invariant knowledge) with a set of classifiers, one per domain, each of which captures domain-specific text classification knowledge. This paradigm is sometimes known as the Shared-Private model Bousmalis et al. (2016). CMSC, however, lacks an explicit mechanism to ensure that the shared classifier captures only domain-independent knowledge: the shared classifier may well also acquire some domain-specific features that are useful for a subset of the domains. We hypothesize that better performance can be obtained if this constraint were explicitly enforced.

In this paper, we thus propose Multinomial Adversarial Networks (henceforth, MANs) for the task of multi-domain text classification. In contrast to standard adversarial networks Goodfellow et al. (2014), which serve as a tool for minimizing the divergence between two distributions Nowozin et al. (2016), MANs represent a family of theoretically sound adversarial networks that, in contrast, leverage a multinomial discriminator to directly minimize the divergence among multiple probability distributions. And just as binomial adversarial networks have been applied to numerous tasks (e.g. image generation Goodfellow et al. (2014), domain adaptation Ganin et al. (2016), cross-lingual sentiment analysis Chen et al. (2016)), we anticipate that MANs will make a versatile machine learning framework with applications beyond the MDTC task studied in this work.

We introduce the MAN architecture in §2 and prove in §3 that it directly minimizes the (generalized) f-divergence among multiple distributions so that they are indistinguishable upon successful training. Specifically for MDTC, MAN is used to overcome the aforementioned limitation in prior art where domain-specific features may sneak into the shared model. This is done by relying on MAN’s power of minimizing the divergence among the feature distributions of each domain. The high-level idea is that MAN will make the extracted feature distributions of each domain indistinguishable from one another, thus learning general features that are invariant across domains.

We then validate the effectiveness of MAN in experiments on two MDTC data sets. We find first that MAN significantly outperforms the state-of-the-art CMSC method Wu and Huang (2015) on the widely used multi-domain Amazon review dataset, and does so without relying on external resources such as sentiment lexica (§4.1). When applied to the FDU-MTL dataset (§4.3), we obtain similar results: MAN achieves substantially higher accuracy than the previous top-performing method, ASP-MTL Liu et al. (2017). ASP-MTL is the first empirical attempt to use a multinomial adversarial network proposed for a multi-task learning setting, but is more restricted and can be viewed as a special case of MAN. In addition, we for the first time provide theoretical guarantees for MAN3) that were absent in ASP-MTL. Finally, while many MDTC methods such as CMSC require labeled data for each domain, MANs can be applied in cases where no labeled data exists for a subset of domains. To evaluate MAN in this semi-supervised setting, we compare MAN to a method that can accommodate unlabeled data for (only) one domain Zhao et al. (2017), and show that MAN achieves performance comparable to the state of the art (§4.2).

2 Model

In this paper, we strive to tackle the text classification problem in a real-world setting in which texts come from a variety of domains, each with a varying amount of labeled data. Specifically, assume we have a total domains, labeled domains (denoted as ) for which there is some labeled data, and unlabeled domains () for which no annotated training instances are available. Denote as the collection of all domains, with being the total number of domains we are faced with. The goal of this work, and MDTC in general, is to improve the overall classification performance across all domains, measured in this paper as the average classification accuracy across the domains in .

2.1 Model Architecture

Figure 1: MAN for MDTC. The figure demonstrates the training on a mini-batch of data from one domain. One training iteration consists of one such mini-batch training from each domain. The parameters of are updated together, and the training flows are illustrated by the green arrows. The parameters of are updated separately, shown in red arrows. Solid lines indicate forward passes while dotted lines are backward passes. is the domain loss for , which is anticorrelated with (e.g. ). (See §23)

As shown in Figure 1, the Multinomial Adversarial Network (MAN) adopts the Shared-Private paradigm of \newciteNIPS2016_6254 and consists of four components: a shared feature extractor , a domain feature extractor for each labeled domain , a text classifier , and finally a domain discriminator . The main idea of MAN is to explicitly model the domain-invariant features that are beneficial to the main classification task across all domains (i.e. shared features, extracted by ), as well as the domain-specific features that mainly contribute to the classification in its own domain (domain features, extracted by ). Here, the adversarial domain discriminator has a multinomial output that takes a shared feature vector and predicts the likelihood of that sample coming from each domain. As seen in Figure 1 during the training flow of (green arrows), aims to confuse by minimizing that is anticorrelated to (detailed in §2.2), so that cannot predict the domain of a sample given its shared features. The intuition is that if even a strong discriminator cannot tell the domain of a sample from the extracted features, those features learned are essentially domain invariant. By enforcing domain-invariant features to be learned by , when trained jointly via backpropagation, the set of domain features extractors will each learn domain-specific features beneficial within its own domain.

The architecture of each component is relatively flexible, and can be decided by the practitioners to suit their particular classification tasks. For instance, the feature extractors can adopt the form of Convolutional Neural Nets (CNN), Recurrent Neural Nets (RNN), or a Multi-Layer Perceptron (MLP), depending on the input data (See §4). The input of MAN will also be dependent on the feature extractor choice. The output of a (shared/domain) feature extractor is a fixed-length vector, which is considered the (shared/domain) hidden features of some given input text. On the other hand, the outputs of and are label probabilities for class and domain prediction, respectively. For example, both and can be MLPs with a softmax layer on top. In §3, we provide alternative architectures for and their mathematical implications. We now present detailed descriptions of the MAN training in §2.2 as well as the theoretical grounds in §3.

1:labeled corpus ; unlabeled corpus ; Hyperpamameter ,
2:repeat
3:     iterations
4:    for  to  do
5:       
6:       for all  do For all domains
7:          Sample a mini-batch
8:           Shared feature vector
9:           Accumulate loss        
10:       Update parameters using     
11:     Main iteration
12:    
13:    for all  do For all labeled domains
14:       Sample a mini-batch
15:       
16:        Domain feature vector
17:        Compute loss     
18:    for all  do For all domains
19:       Sample a mini-batch
20:       
21:        Domain loss of     
22:    Update , , parameters using
23:until convergence
Algorithm 1 MAN Training

2.2 Training

Denote the annotated corpus in a labeled domain as ; and is a sample drawn from the labeled data in domain , where is the input and is the task label. On the other hand, for any domain , denote the unlabeled corpus as . Note for a labeled domain, one can use a separate unlabeled corpus or simply use the labeled data (or use both).

In Figure 1, the arrows illustrate the training flows of various components. Due to the adversarial nature of the domain discriminator , it is trained with a separate optimizer (red arrows), while the rest of the networks are updated with the main optimizer (green arrows). is only trained on labeled domains, and it takes as input the concatenation of the shared and domain feature vectors. At test time for unlabeled domains with no , the domain features are set to the vector for ’s input. On the contrary, only takes the shared features as input, for both labeled and unlabeled domains. The MAN training is described in Algorithm 1.

In Algorithm 1, and are the loss functions of the text classifier and the domain discriminator , respectively. As mentioned in  §2.1, has a layer on top for classification. We hence adopt the canonical negative log-likelihood (NLL) loss:

(1)

where is the true label and is the predictions. For , we consider two variants of MAN. The first one is to use the NLL loss same as which suits the classification task; while another option is to use the Least-Square (L2) loss that was shown to be able to alleviate the gradient vanishing problem when using the NLL loss in the adversarial setting Mao et al. (2017):

(2)
(3)

where is the domain index of some sample and is the prediction. Without loss of generality, we normalize so that and .

Therefore, the objectives of and that we are minimizing are:

(4)
(5)

For the feature extractors, the training of domain feature extractors is straightforward, as their sole objective is to help perform better within their own domain. Hence, for any domain . Finally, the shared feature extractor has two objectives: to help achieve higher accuracy, and to make the feature distribution invariant across all domains. It thus leads to the following bipartite loss:

where is a hyperparameter balancing the two parts. is the domain loss of anticorrelated to :

(6)
(7)

If adopts the NLL loss (6), the domain loss is simply . For the L2 loss (7), intuitively translates to pushing to make random predictions. See §3 for theoretical justifications.

3 Theories of Multinomial Adversarial Networks

The binomial adversarial nets are known to have theoretical connections to the minimization of various f-divergences between two distributions Nowozin et al. (2016). However, for adversarial training among multiple distributions, despite similar idea has been empirically experimented Liu et al. (2017), no theoretical justifications have been provided to our best knowledge.

In this section, we present a theoretical analysis showing the validity of MAN. In particular, we show that MAN’s objective is equivalent to minimizing the total f-divergence between each of the shared feature distributions of the domains, and the centroid of the distributions. The choice of loss function will determine which specific f-divergence is minimized. Furthermore, with adequate model capacity, MAN achieves its optimum for either loss function if and only if all shared feature distributions are identical, hence learning an invariant feature space across all domains.

First consider the distribution of the shared features for instances in each domain :

(8)

Combining (5) with the two loss functions (2), (3), the objective of can be written as:

(9)
(10)

where is the -th dimension of ’s (normalized) output vector, which conceptually corresponds to the probability of predicting that is from domain .

We first derive the optimal for any fixed .

Lemma 1.

For any fixed , with either NLL or L2 loss, the optimum domain discriminator is:

(11)

The proof involves an application of the Lagrangian Multiplier to solve the minimum value of , and the details can be found in the Appendix. We then have the following main theorems for the domain loss for :

Theorem 1.

Let . When is trained to its optimality, if adopts the NLL loss:

where is the generalized Jensen-Shannon Divergence Lin (1991) among multiple distributions, defined as the average Kullback-Leibler divergence of each to the centroid  Aslam and Pavlu (2007).

Theorem 2.

If uses the L2 loss:

where is the Neyman divergence Nielsen and Nock (2014). The proof of both theorems can be found in the Appendix.

Consequently, by the non-negativity and joint convexity of the f-divergence Csiszar and Korner (1982), we have:

Corollary 1.

The optimum of is when using NLL loss, and for the L2 loss. The optimum value above is achieved if and only if for either loss.

Therefore, the loss of can be interpreted as simultaneously minimizing the classification loss as well as the divergence among feature distributions of all domains. It can thus learn a shared feature mapping that are invariant across domains upon successful training while being beneficial to the main classification task.

4 Experiments

4.1 Multi-Domain Text Classification

In this experiment, we compare MAN to state-of-the-art MDTC systems, on the multi-domain Amazon review dataset Blitzer et al. (2007), which is one of the most widely used MDTC datasets. Note that this dataset was already preprocessed into a bag of features (unigrams and bigrams), losing all word order information. This prohibits the usage of CNNs or RNNs as feature extractors, limiting the potential performance of the system. Nonetheless, we adopt the same dataset for fair comparison and employ a MLP as our feature extractor. In particular, we take the 5000 most frequent features and represent each review as a 5000d feature vector, where feature values are raw counts of the features. Our MLP feature extractor would then have an input size of 5000 in order to process the reviews.

The Amazon dataset contains 2000 samples for each of the four domains: book, DVD, electronics, and kitchen, with binary labels (positive, negative). Following \newcitewu2015, we conduct 5-way cross validation. Three out of the five folds are treated as training set, one serves as the validation set, while the remaining being the test set. The 5-fold average test accuracy is reported.

[t] Book DVD Elec. Kit. Avg. Domain-Specific Models Only  LS 77.80 77.88 81.63 84.33 80.41  SVM 78.56 78.66 83.03 84.74 81.25  LR 79.73 80.14 84.54 86.10 82.63 MLP 81.70 81.65 85.45 85.95 83.69 Shared Model Only  LS 78.40 79.76 84.67 85.73 82.14  SVM 79.16 80.97 85.15 86.06 82.83  LR 80.05 81.88 85.19 86.56 83.42 MLP 82.40 82.15 85.90 88.20 84.66 MAN-L2-MLP 82.05 83.45 86.45 88.85 85.20 MAN-NLL-MLP 81.85 83.10 85.75 89.10 84.95 Shared-Private Models  RMTL1 81.33 82.18 85.49 87.02 84.01  MTLGraph2 79.66 81.84 83.69 87.06 83.06  CMSC-LS3 82.10 82.40 86.12 87.56 84.55  CMSC-SVM3 82.26 83.48 86.76 88.20 85.18  CMSC-LR3 81.81 83.73 86.67 88.23 85.11 SP-MLP 82.00 84.05 86.85 87.30 85.05 MAN-L2-SP-MLP 82.46 () 83.98 () 87.22* () 88.53 () 85.55* () MAN-NLL-SP-MLP 82.98* () 84.03 () 87.06 () 88.57* () 85.66* ()

  • \newcite

    Evgeniou:2004:RML:1014052.1014067

  • \newcite

    zhou2012mutal

  • \newcite

    wu2015

Table 1: MDTC results on the Amazon dataset. Models in bold are ours while the performance of the rest is taken from \newcitewu2015. Numbers in parentheses indicate standard errors, calculated based on 5 runs. Bold numbers indicate the highest performance in each domain, and shows statistical significance () over CMSC under a one-sample T-Test.

Table 1 shows the main results. Three types of models are shown: Domain-Specific Models Only, where only in-domain models are trained222For our models, it means is disabled. Similarly, for Shared Model Only, no is used.; Shared Model Only, where a single model is trained with all data; and Shared-Private Models, a combination of the previous two. Within each category, various architectures are examined, such as Least Square (LS), SVM, and Logistic Regression (LR). As explained before, we use MLP as our feature extractors for all our models (bold ones). Among our models, the ones with the MAN prefix use adversarial training, and MAN-L2 and MAN-NLL indicate the L2 loss and NLL loss MAN, respectively.

From Table 1, we can see that by adopting modern deep neural networks, our methods achieve superior performance within the first two model categories even without adversarial training. This is corroborated by the fact that our SP-MLP model performs comparably to CMSC, while the latter relies on external resources such as sentiment lexica. Moreover, when our multinomial adversarial nets are introduced, further improvement is observed. With both loss functions, MAN outperforms all Shared-Private baseline systems on each domain, and achieves statistically significantly higher overall performance. For our MAN-SP models, we provide the mean accuracy as well as the standard errors over five runs, to illustrate the performance variance and conduct significance test. It can be seen that MAN’s performance is relatively stable, and consistently outperforms CMSC.

4.2 Experiments for Unlabeled Domains

As CMSC requires labeled data for each domain, their experiments were naturally designed this way. In reality, however, many domains may not have any annotated corpora available. It is therefore also important to look at the performance in these unlabeled domains for a MDTC system. Fortunately, as depicted before, MAN’s adversarial training only utilizes unlabeled data from each domain to learn the domain-invariant features, and can thus be used on unlabeled domains as well. During testing, only the shared feature vector is fed into , while the domain feature vector is set to .

 Target Domain Book DVD Elec. Kit. Avg.
 MLP 76.55 75.88 84.60 85.45 80.46
 mSDA1 76.98 78.61 81.98 84.26 80.46
 DANN2 77.89 78.86 84.91 86.39 82.01
 MDAN (H-MAX)3 78.45 77.97 84.83 85.80 81.76
 MDAN (S-MAX)3 78.63 80.65 85.34 86.26 82.72
MAN-L2-SP-MLP 78.45 81.57 83.37 85.57 82.24
MAN-NLL-SP-MLP 77.78 82.74 83.75 86.41 82.67
  • \newcite

    ICML2012Chen_416

  • \newcite

    ganin2016domain

  • \newcite

    DBLP:journals/corr/ZhaoZWCMG17

Table 2: Results on unlabeled domains. Models in bold are our models while the rest is taken from \newciteDBLP:journals/corr/ZhaoZWCMG17. Highest domain performance is shown in bold.

In order to validate MAN’s effectiveness, we compare to state-of-the-art multi-source domain adaptation (MS-DA) methods (See §5). Compared to standard domain adaptation methods with one source and one target domain, MS-DA allows the adaptation from multiple source domains to a single target domain. Analogically, MDTC can be viewed as multi-source multi-target domain adaptation, which is superior when multiple target domains exist. With multiple target domains, MS-DA will need to treat each one as an independent task, which is more expensive and cannot utilize the unlabeled data in other target domains.

In this work, we compare MAN with one recent MS-DA method, MDAN Zhao et al. (2017). Their experiments only have one target domain to suit their approach, and we follow this setting for fair comparison. However, it is worth noting that MAN is designed for the MDTC setting, and can deal with multiple target domains at the same time, which can potentially improve the performance by taking advantage of more unlabeled data from multiple target domains during adversarial training. We adopt the same setting as \newciteDBLP:journals/corr/ZhaoZWCMG17, which is based on the same multi-domain Amazon review dataset. Each of the four domains in the dataset is treated as the target domain in four separate experiments, while the remaining three are used as source domains.

books elec. dvd kitchen apparel camera health music toys video baby magaz. softw. sports IMDb MR Avg.
Domain-Specific Models Only
 BiLSTM 81.0 78.5 80.5 81.2 86.0 86.0 78.7 77.2 84.7 83.7 83.5 91.5 85.7 84.0 85.0 74.7 82.6
CNN 85.3 87.8 76.3 84.5 86.3 89.0 87.5 81.5 87.0 82.3 82.5 86.8 87.5 85.3 83.3 75.5 84.3
Shared Model Only
 FS-MTL 82.5 85.7 83.5 86.0 84.5 86.5 88.0 81.2 84.5 83.7 88.0 92.5 86.2 85.5 82.5 74.7 84.7
MAN-L2-CNN 88.3 88.3 87.8 88.5 85.3 90.5 90.8 85.3 89.5 89.0 89.5 91.3 88.3 89.5 88.5 73.8 87.7
MAN-NLL-CNN 88.0 87.8 87.3 88.5 86.3 90.8 89.8 84.8 89.3 89.3 87.8 91.8 90.0 90.3 87.3 73.5 87.6
Shared-Private Models
 ASP-MTL 84.0 86.8 85.5 86.2 87.0 89.2 88.2 82.5 88.0 84.5 88.2 92.2 87.2 85.7 85.5 76.7 86.1
MAN-L2-SP-CNN 87.6* 87.4 88.1* 89.8* 87.6 91.4* 89.8* 85.9* 90.0* 89.5* 90.0 92.5 90.4* 89.0* 86.6 76.1 88.2*
MAN-NLL-SP-CNN 86.8* 88.8 88.6* 89.9* 87.6 90.7 89.4 85.5* 90.4* 89.6* 90.2 92.9 90.9* 89.0* 87.0* 76.7 88.4*
Table 3: Results on the FDU-MTL dataset. Bolded models are ours, while the rest is from \newciteP17-1001. Highest performance is each domain is highlighted. For our full MAN models, standard errors are shown in parenthese and statistical significance () over ASP-MTL is indicated by *.

In Table 2, the target domain is shown on top, and the test set accuracy is reported for various systems. It shows that MAN outperforms several baseline systems, such as a MLP trained on the source-domains, as well as single-source domain adaptation methods such as mSDA Chen et al. (2012) and DANN Ganin et al. (2016), where the training data in the multiple source domains are combined and viewed as a single domain. Finally, when compared to MDAN, MAN and MDAN each achieves higher accuracy on two out of the four target domains, and the average accuracy of MAN is similar to MDAN. Therefore, MAN achieves competitive performance for the domains without annotated corpus. Nevertheless, unlike MS-DA methods, MAN can handle multiple target domains at one time.

4.3 Experiments on the MTL Dataset

To make fair comparisons, the previous experiments follow the standard settings in the literature, where the widely adopted Amazon review dataset is used. However, this dataset has a few limitations: First, it has only four domains. In addition, the reviews are already tokenized and converted to a bag of features consisting of unigrams and bigrams. Raw review texts are hence not available in this dataset, making it impossible to use certain modern neural architectures such as CNNs and RNNs. To provide more insights on how well MAN work with other feature extractor architectures, we provide a third set of experiments on the FDU-MTL dataset Liu et al. (2017). The dataset is created as a multi-task learning dataset with 16 tasks, where each task is essentially a different domain of reviews. It has 14 Amazon domains: books, electronics, DVD, kitchen, apparel, camera, health, music, toys, video, baby, magazine, software, and sports, in addition to two movies review domains from the IMDb and the MR dataset. Each domain has a development set of 200 samples, and a test set of 400 samples. The amount of training and unlabeled data vary across domains but are roughly 1400 and 2000, respectively.

We compare MAN with ASP-MTL Liu et al. (2017) on this FDU-MTL dataset. ASP-MTL also adopts adversarial training for learning a shared feature space, and can be viewed as a special case of MAN when adopting the NLL loss (MAN-NLL). Furthermore, while \newciteP17-1001 do not provide any theoretically justifications, we in §3 prove the validity of MAN for not only the NLL loss, but an additional L2 loss. Besides the theoretical superiority, we in this section show that MAN also substantially outperforms ASP-MTL in practice due to the feature extractor choice.

In particular, \newciteP17-1001 choose LSTM as their feature extractor, yet we found CNN Kim (2014) to achieve much better accuracy while being times faster. Indeed, as shown in Table 3, with or without adversarial training, our CNN models outperform LSTM ones by a large margin. When MAN is introduced, we attain the state-of-the-art performance on every domain with a 88.4% overall accuracy, surpassing ASP-MTL by a significant margin of 2.3%.

We hypothesize the reason LSTM performs much inferior to CNN is attributed to the lack of attention mechanism. In ASP-MTL, only the last hidden unit is taken as the extracted features. While LSTM is effective for representing the context for each token, it might not be powerful enough for directly encoding the entire document Bahdanau et al. (2015). Therefore, various attention mechanisms have been introduced on top of the vanilla LSTM to select words (and contexts) most relevant for making the predictions. In our preliminary experiments, we find that Bi-directional LSTM with the dot-product attention Luong et al. (2015) yields better performance than the vanilla LSTM in ASP-MTL. However, it still does not outperform CNN and is much slower. As a result, we conclude that, for text classification tasks, CNN is both effective and efficient in extracting local and higher-level features for making a single categorization.

Finally, we observe that MAN-NLL achieves slightly higher overall performance compared to MAN-L2, providing evidence for the claim in a recent study Lucic et al. (2017) that the original GAN loss (NLL) may not be inherently inferior. Moreover, the two variants excel in different domains, suggesting the possibility of further performance gain when using ensemble.

5 Related Work

Multi-Domain Text Classification

The MDTC task was first examined by \newciteP08-2065, who proposed to fusion the training data from multiple domains either on the feature level or the classifier level. The prior art of MDTC Wu and Huang (2015) decomposes the text classifier into a general one and a set of domain-specific ones. However, the general classifier is learned by parameter sharing and domain-specific knowledge may sneak into it. They also require external resources to help improve accuracy and compute domain similarities.

Domain Adaptation

Domain Adaptation attempts to transfer the knowledge from a source domain to a target one, and the traditional form is the single-source, single-target (SS,ST) adaptation Blitzer et al. (2006). Another variant is the SS,MT adaptation Yang and Eisenstein (2015), which tries to simultaneously transfer the knowledge to multiple target domains from a single source. However, it cannot fully take advantage the training data if it comes from multiple source domains. MS,ST adaptation Mansour et al. (2009); Zhao et al. (2017) can deal with multiple source domains but only transfers to a single target domain. Therefore, when multiple target domains exist, they need to treat them as independent problems, which is more expensive and cannot utilize the additional unlabeled data in these domains. Finally, MDTC can be viewed as MS,MT adaptation, which is arguably more general and realistic.

Adversarial Networks

The idea of adversarial networks was proposed by \newciteNIPS2014_5423 for image generation, and has been applied to various NLP tasks as well Chen et al. (2016); Li et al. (2017). \newciteganin2016domain first used it for the SS,ST domain adaptation followed by many others. \newciteNIPS2016_6254 utilized adversarial training in a shared-private model for domain adaptation to learn domain-invariant features, but still focused on the SS,ST setting. Finally, the idea of using adversarial nets to discriminate over multiple distributions was empirically explored by a very recent work Liu et al. (2017) under the multi-task learning setting, and can be considered as a special case of our MAN framework with the NLL domain loss. Nevertheless, we propose a more general framework with alternative architectures for the adversarial component, and for the first time provide theoretical justifications for the multinomial adversarial nets. Moreover, \newciteP17-1001 used LSTM without attention as their feature extractor, which we found to perform sub-optimal in the experiments. We instead chose Convolutional Neural Nets as our feature extractor that achieves higher accuracy while running an order of magnitude faster (See §4.3).

6 Conclusion

In this work, we propose a family of Multinomial Adversarial Networks (MAN) that generalize the traditional binomial adversarial nets in the sense that MAN can simultaneously minimize the difference among multiple probability distributions instead of two. We provide theoretical justifications for two instances of MAN, MAN-NLL and MAN-L2, showing they are minimizers of two different f-divergence metrics among multiple distributions, respectively. This indicates MAN can be used to make multiple distributions indistinguishable from one another. It can hence be applied to a variety of tasks, similar to the versatile binomial adversarial nets, which have been used in many areas for making two distributions alike.

We in this paper design a MAN model for the MDTC task, following the shared-private paradigm that has a shared feature extractor to learn domain-invariant features and domain feature extractors to learn domain-specific ones. MAN is used to enforce the shared feature extractor to learn only domain-invariant knowledge, by resorting to MAN’s power of making indistinguishable the shared feature distributions of samples from each domain. We conduct extensive experiments, demonstrating our MAN model outperforms the prior art systems in MDTC, and achieves state-of-the-art performance on domains without labeled data when compared to multi-source domain adaptation methods.

References

Appendix A Proofs

a.1 Proofs for Man-Nll

Assume we have domains, consider the distribution of the shared features for instances in each domain :

The objective that attempts to minimize is:

(12)

where is the -th dimension of ’s output vector, which conceptually corresponds to the softmax probability of predicting that is from domain . We therefore have property that for any :

(13)
Lemma 2.

For any fixed , the optimum domain discriminator is:

(14)
Proof.

For a fixed , the optimum

We employ the Lagrangian Multiplier to derive under the constraint of (13). Let

Let :

Solving the two equations, we have:

On the other hand, the loss function of the shared feature extractor consists of two additive components, the loss from the text classifier , and the loss from the domain discriminator :

(15)

We have the following theorem for the domain loss for :

Theorem 3.

When is trained to its optimality:

(16)

where is the generalized Jensen-Shannon Divergence Lin (1991) among multiple distributions.

Proof.

Let .

There are two equivalent definitions of the generalized Jensen-Shannon divergence: the original definition based on Shannon entropy Lin (1991), and a reshaped one expressed as the average Kullback-Leibler divergence of each to the centroid  Aslam and Pavlu (2007). We adopt the latter one here:

(17)

Now substituting into :

Consequently, by the non-negativity of  Lin (1991), we have the following corollary:

Corollary 2.

The optimum of is , and is achieved if and only if .

a.2 Proofs for Man-L2

The proof is similar for MAN with the L2 loss. The loss function used by is, for a sample from domain with shared feature vector :

(18)

So the objective that minimizes is:

(19)

For simplicity, we further constrain ’s outputs to be on a simplex:

(20)
Lemma 3.

For any fixed , the optimum domain discriminator is:

(21)
Proof.

For a fixed , the optimum

Similar to MAN-NLL, we employ the Lagrangian Multiplier to derive under the constraint of (20). Let :

Solving the two equations, we have and:

For the domain loss of :

Theorem 4.

Let . When is trained to its optimality:

(22)

where is the Neyman divergence Nielsen and Nock (2014).

Proof.

Substituting into :

Finally, by the joint convexity of f-divergence, we have the following corollary:

Corollary 3.

and the equality is attained if and only if .

Appendix B Implementation Details

For all three of our experiments, we use and (See Algorithm 1). For both optimizers, Adam Kingma and Ba (2015) is used with learning rate . The size of the shared feature vector is set to while that of the domain feature vector is . Dropout of is used in all components. and each has one hidden layer of the same size as their input ( for and for ). ReLU is used as the activation function. Batch normalization Ioffe and Szegedy (2015) is used in both and but not . We use a batch size of 8.

For our first two experiments on the Amazon review dataset, the MLP feature extractor is used. As described in the paper, it has an input size of 5000. Two hidden layers are used, with size and , respectively.

For the CNN feature extractor used in the FDU-MTL experiment, a single convolution layer is used. The kernel sizes are 3, 4, and 5, and the number of kernels are 200. The convolution layers take as input the 100d word embeddings of each word in the input sequence. We use word2vec word embeddings Mikolov et al. (2013) trained on a bunch of unlabeled raw Amazon reviews Blitzer et al. (2007). After convolution, the outputs go through a ReLU layer before fed into a max pooling layer. The pooled output is then fed into a single fully connected layer to be converted into a feature vector of size either 128 or 64. More details of using CNN for text classification can be found in the original paper Kim (2014). MAN is implemented using PyTorch Paszke et al. (2017).

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
39942
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description