Pre-train, Interact, Fine-tune: A Novel Interaction Representation for Text Classification
Text representation can aid machines in understanding text. Previous work on text representation often focuses on the so-called forward implication, i.e., preceding words are taken as the context of later words for creating representations, thus ignoring the fact that the semantics of a text segment is a product of the mutual implication of words in the text: later words contribute to the meaning of preceding words. We introduce the concept of interaction and propose a two-perspective interaction representation, that encapsulates a local and a global interaction representation. Here, a local interaction representation is one that interacts among words with parent-children relationships on the syntactic trees and a global interaction interpretation is one that interacts among all the words in a sentence. We combine the two interaction representations to develop a Hybrid Interaction Representation (HIR).
Inspired by existing feature-based and fine-tuning-based pretrain-finetuning approaches to language models , we integrate the advantages of feature-based and fine-tuning-based methods to propose the Pre-train, Interact, Fine-tune (PIF) architecture.
We evaluate our proposed models on five widely-used datasets for text classification tasks. Our ensemble method, HIR, outperforms state-of-the-art baselines with improvements ranging from 2.03% to 3.15% in terms of error rate. In addition, we find that, the improvements of PIF against most state-of-the-art methods is not affected by increasing of the length of the text.
keywords:Interaction representation, Pre-training, Fine-tuning, Classification
Text representations map text spans into real-valued vectors or matrices. They have come to play a crucial role in machine understanding of text. Applications include sentiment classification (Tang2015Document), question answering (Qin2017Enhancing), summarization (Ren2017Leveraging), and sentence inference (Parikh2016A).
Previous work on text representation can be categorized into three main types (Xie2016I), i.e., statistics-based, neural-network-based and pretraining-based embeddings. Statistics-based embedding models are estimated based on a statistical indicator, e.g., the frequency of co-occurring words (in bag-of-words models (Joachims1998Text)), the frequency of co-occurring word pairs (in n-gram models (Zhang2015Character)), and the weights of words in different documents (the TF-IDF model (Robertson2004Understanding)). Neural-network-based embedding models mainly rely on a neural network architecture to learn a text representation, based on a hidden layer (Joulin2016Bag), convolutional neural networks (Kim2014Convolutional) or recurrent neural networks (Liu2016Recurrent). Additionally, this type of methods may also consider the syntactic structure to reflect the semantics of text, e.g., recursive neural networks (Socher2013EMNLP) and tree-structured long short-term memory networks (Tree-LSTM) (Tai2015improved). Pretraining-based embedding models adopt a feature-based (Mikolov2013Efficient; Pennington2014G; McCann2017L; Peters2018D) or fine-tuning strategy (Dai2015S; Howard2018U; Devlin2019B; Yang2019X) to capture the semantics and syntactic information from a large text corpora.
In general, the aforementioned models work well for the task of text classification. (Joulin2016Bag; Kim2014Convolutional; Zhang2015Character; Howard2018U) However, in existing embedding models, the generated process of the vectorized representation of a text usually follows a so-called one-way action. That is to say, representations generated for the preceding text are taken as the context to determine the representations of later texts. Although a bidirectional LSTM considers bidirectional actions, it simply concatenates two one-way actions to get the embeddings. We argue that the semantics as defined in terms of a text representation should be a product of interactions of all source elements (e.g., words or sentences) in the text. Restrictions to one-way actions may result in a partial semantic loss (Saif2016C), causing the poor performances in the downstream applications. We hypothesize that although these interaction relations may be learned by neural networks with enough samples, explicitly modeling such interaction relations can directly make text representation more informative and effective. Furthermore, recent unsupervised representation learning has proven to be effective and promising in the field of natural language processing (McCann2017L; Peters2018D; Howard2018U; Devlin2019B; Yang2019X). So far, these approaches are limited to a single strategy (either feature-based or fine-tuning strategy), which results in a so-called fine-tune error, which may be trapped in the local best.
Thus, as illustrated in Figure 1, we focus on the task of text classification and propose a novel pipeline with the following ingredients:
pre-train language model on a large text corpus to get the related word embeddings and neural networks parameters;
interact the word embeddings based on the pre-trained parameters to obtain the interaction representation; and
fine-tune the classifier with the interaction representation and pre-trained word embeddings as input.
More specifically, in the interaction representation layer, we propose a two-perspective interaction representation using a Local Interaction Representation (LIR) and a Global Interaction Representation (GIR). The Local Interaction Representation (LIR) applies an attention mechanism (Bahdanau2015Neural) inside the syntactic structure of a sentence, e.g., the dependency-based parse trees or constituency-based parse trees, to reflect the local interaction of adjacent words. The Global Interaction Representation (GIR) employs an attention mechanism with an enumeration-based strategy to represent the interactions of all words in a sentence. After that, we combine LIR and GIR to into a Hybrid Interaction Representation (HIR) model to represent both local and global interactions of words in a sentence. For the pretrain-finetuning process, we combine the feature-based and the fine-tuning strategies and propose a hybrid language model pretrain-finetuning (HLMPf) approach. hybrid language model pretrain-finetuning (HLMPf) first follows the fine-tuning strategy to employ the pre-trained embeddings and neural network parameters as the initialization of the interaction representation layer. Then, according to the feature-based strategy, HLMPf applies the pre-trained embeddings as additional features and concatenates the interaction representation in the classifier fine-tuning layer.
For evaluation, we conduct a comprehensive experiment on five publicly available benchmark datasets for the task of text classification. The experimental results show that our proposal with interaction representations and the hybrid pretrain-finetuning strategy outperforms the state-of-the-art baselines for text classification, with improvements ranging from 2.03% to 3.15% in terms of accuracy.
The main contributions of our work are as follows:
We propose a novel pipeline for the task of text classification, i.e., Pre-train, Interact, Fine-tune (PIF).
To the best of our knowledge, ours is the first attempt to model word interactions for text representation. We introduce a two-perspective interaction representation for text classification, i.e., a Local Interaction Representation (LIR) and a Global Interaction Representation (GIR), which are then combined to generate a Hybrid Interaction Representation (HIR) model.
We combine the advantages of two popular language model pretrain-fine-tuning strategies (feature-based and fine-tuning) and propose the hybrid language model pretrain-finetuning (HLMPf).
We analyze the effectiveness of our proposal and find that it outperforms the state-of-the-art methods for text classification in terms of accuracy.
2 Related Work
In this section, we briefly summarize the general statistical approaches for text representation in Section 2.1 and the neural-networks-based methods in Section 2.2. We then describe the recent work on language model pre-training for downstream applications in Section 2.3.
2.1 Statistics-based representation
As a word is the most basic unit of semantics, the traditional one-hot representation model converts a word in a vocabulary into a sparse vector with a single high value (i.e., 1) in its position and the others with a low value (i.e., 0). The representation is employed in the Bag-of-Words (BoW) model (Joachims1998Text) to reflect the word frequency. However, the BoW model only symbolizes the word and cannot reflect the semantic relationship between words. Consequently, the bag-of-means model (Zhang2015Character) was proposed to cluster the word embeddings learned by the word2vec model (Mikolov2013Efficient). Furthermore, the bag-of-n-grams (Zhang2015Character) was developed to take the n-grams (up to 5-grams) as the vocabulary in the BoW model. In addition, with some extra statistical information, e.g., TF-IDF, a better document representation can be produced (Robertson2004Understanding). Other text features, e.g., the noun phrases (Lewis1992An) and the tree kernels (Post2013Explicit), were incorporated into the model construction.
Clearly, a progressive step has been made in statistical based representation (Bernauer2018T). However, such traditional statistical representation approaches inevitably face the problems of data sparsity and dimensionality, leading to no applications on large-scale corpora. In addition, such approaches are simply built on shallow statistics, and a deeper semantic information of the text has not been well developed.
Instead, our proposal in this paper based on neural networks has the ability to learn a low-dimensional and distributed representation to overcome such problems.
2.2 Neural-based representation
Since Bengio2013A first employed the neural network architecture to train a language model, considerable attention has been devoted to proposing neural network-related models for text representation. For instance, the FastText model (Joulin2016Bag) employs one hidden layer to integrate the subword information and obtains satisfactory results. However, this model simply averages all word embeddings and discards the word order. In view of that, Liu2016Recurrent employed the recurrent structure, i.e., RNNs, to consider the word order and to jointly learn text representation across multiple related tasks. Compared to RNNs, CNNs are easier to train and capture the local word-pair information (Kim2014Convolutional; Zhang2015Character).
Furthermore, a combination of neural network models are integrated to develop the advantage of each single neural network. For example, Lai2015Recurrent proposed the recurrent convolutional neural networks (RCNN), which adopted the recurrent structure to grasp the context information and employed a max-pooling layer to identify the key components in text. Besides, other document features have been injected into the document modeling. For instance, Zheng2019C took the hierarchical structure of text into account. He2018E transformed the document-level knowledge to improve the performance of aspect-level sentiment classification.
Although these approaches have been proved effective in the downstream applications, they completely depend on the structure of network to implicitly represent a document, ignoring the interaction that exists among the source elements in a document, e.g., words or sentences. However, our proposal can model the interaction as the starting point to better reflect the semantic relationship between words in a sentence, which we argue can help improve the performance of downstream tasks, e.g, sentimental classification.
2.3 Language model pre-training-based representation
The language pre-training model has been shown effective for the natural language processing tasks, e.g., question answering (McCann2017L), textual entailment (Peters2018D), semantic role labeling (Devlin2019B) sentimental analysis (Dai2015S), etc. These pre-training models can be mainly classified into two classes, i.e., feature-based models and fine-tuning models.
The feature-based models generate the pre-trained embeddings from other tasks, where the output can be regarded as the additional features for the current task architecture. For instance, word2vec (Mikolov2013Efficient) and GloVe (Pennington2014G) focus on transforming words into the distributed representations and capturing the syntactics as well as the semantics by pre-training the neural language models on a large text corpora. In addition, McCann2017L concentrated on the machine translation task to get the contextualize word vectors (CoVe). Since these word-level models suffer from the word-ploysemy, Peters2018D developed the sequence-level model, i.e., ELMo, to capture the complex word features across different linguistic contexts and then use ELMo to generate the context-sensitive word embeddings.
Different from the feature-based strategy (Mehta2018E), the fine-tuning models first produce the contextual word presentations which have been pre-trained from unlabeled text and fine-tune for a supervised downstream task. For instance, Dai2015S trained a sequence auto-encoder model on unlabeled text as an initialization of another supervised network. However, this method suffers from overfitting and requires some in-domain knowledge to improve the performance. Consequently, Universal Language Model Fine-tuning (ULMFit) (Howard2018U) was developed, which leveraged the general-domain pre-training and the novel fine-tuning techniques to prevent overfitting. In addition, Devlin2019B proposed two unsupervised tasks, i.e., masked language model and next sentence prediction, to further improve fine-tuning process. In addition, XLNet (Yang2019X) was proposed to employ the permutation language model to capture the bidirectional context and avoid the pretrain-finetune discrepancy.
Although the language pre-training model based representations have been proposed and proved promising in the NLP tasks, these methods are limited to either feature-based or fine-tune-based strategy. Our proposal combine their respective characteristics to improve the performance of downstream applications.
3 Proposed Models
In this section, we first formally describe how to compute the interaction representation in Section 3.1, which can be divided into three parts, i.e., LIR (see Section 3.1.1),GIR (see Section 3.1.2) and HIR (see Section 3.1.3). And then, we introduce the HLMPf approach in detail (see Section 3.2), which is the combination of the feature-based and fine-tuning strategies.
3.1 Interaction representation
We describe the Local Interaction Representation (LIR) of adjacent words and introduce the Global Interaction Representation (GIR) of all words in a sentence. After that, a Hybrid Interaction Representation (HIR) model is proposed.
3.1.1 Local interaction representation
We introduce an attentive tree LSTM that computes a local representation of words. The idea of an action of a word on another word is that the former assigns a semantic weight to the latter.
The experiments we conduct related to LIR are based on constituency-based trees, but we explain the core concepts for both dependency-based and constituency-based trees. Given a dependency-based parse tree, let denote the set of child words of a parent word . To define the attentive tree LSTM, we introduce hidden states and memory cells and , , …, for every child word, respectively. As shown in Fig. 2, unlike the Tree-LSTM model in (Tai2015improved) that only performs the one-way action (child words parent word), LIR also considers an action in the opposite direction, i.e., parent word child words.
Let us explain this in detail. In an action parent word child words, we regard the parent word as a controller that assigns semantic weights based on the attention mechanism to its child words in a sentence Saraiva2016A. Thus, we first convert the parent word into a hidden representation as follows:
where is the pre-trained word embedding for parent word ; and are the weight matrix and the bias term, respectively. Then, we employ a general content-based function (Luong2015Effective) to connect the parent word and the child words as follows:
where is the connective representation of and the hidden state , and is the connective matrix to be learned. After that, we apply a softmax function on a sequence of connective representations to get the weight as follows:
Finally, we represent the hidden interaction state that relates to all child states of the parent word , i.e.,
In the action child words parent word in Fig. 2, we use the hidden interaction state and the parent word as input to the LSTM cell and obtain
where , and are the input gate, the output gate and the forget gate, respectively; is the candidate hidden state of . For , , and , we have a corresponding weight matrix of (i.e., , , and ), a weight matrix of (or ) (i.e., , , and ), and a bias term (i.e., , , and ). Finally, we can get the memory cell and the hidden state of the parent word as follows:
where is element-wise multiplication and is the memory cell of a child word.
Similarly, given a constituency-based tree, let and denote the left child word and the right child word of a parent word . Since the parent word is a non-terminal node (i.e., is a zero vector), we use and as the controller instead of , respectively. Therefore, following Eq. (2)–(4), we obtain the hidden interaction states and related to and , respectively. We concatenate and to represent the hidden interaction states of the parent word, i.e., . Again, following Eq. (5)–(10), we can get the memory cell and the hidden state for parent word .
At this stage we have represented the local interaction, and each word has been updated by the interaction representation.
3.1.2 Global interaction representation
Unlike LIR, which captures the syntactic relation between words, GIR adopts a enumeration-based strategy to employ an attention mechanism on all words in a sentence.
In detail, after implementing Tree-LSTM on all words in a sentence, we can have the hidden representations corresponding to the words . In order to represent the interaction between a word and the other words in a sentence, we regard the word as a controller that can assign semantic weights to other words in excluding itself. Similarly, we employ a general content-based function to connect the word with other words as follows:
where is the connective representation of and . After that, we can get all connective representations between the word and other words. Then, we can apply a softmax function on the connective representation sequence to calculate the weight as follows:
where is the weight of word in that interacts with word . Finally, we obtain the interaction representation as follows:
By doing so, we enumerate all words in a sentence and can return a sequence of interaction representations as . We then adopt a max-pooling on this sequence to produce the sentence embeddings by
This completes the definition of the global interaction representation. We can train the sentence representation to update the pre-trained embeddings.
3.1.3 Hybrid interaction representation
In order to capture both local and global interactions between words, we combine LIR and GIR to form a hybrid interaction representation model (HIR) for text representation. HIR first follows the procedure of LIR to produce the hidden state representations for the corresponding word . Then, HIR employs the process of GIR on these hidden state representations to get the final sentence embeddings .
Eventually, in the process of class prediction, we apply a softmax classifier on the sentence embeddings to get a predicted label , where and is the class label set, i.e.,
Here, and are the reshape matrix and the bias term, respectively. For formulating the loss function in HIR, we combine the corresponding loss in LIR and GIR as
where the former loss comes from LIR and the latter from GIR, is the trade-off parameter. In addition, is the hidden state and is the true class label of word in LIR; is the true class label of sentence embeddings in GIR. In addition, and can be trained using the dataset.
We have now introduced the main process of our HIR model. Clearly, as shown in Algorithm 1, we first employed bi-lstm process the pre-trained word sequence to build their semantics relations from step 1 to 2. Then, with the help of syntactic parse tool, we can get the parent-child set . Following the bottom-up traversal algorithm, we show how to model the local interaction representation between parent word and child words from step 4 to 11. While from step 12 to 19, we show how to compute the global interaction representation between all words. At last, we optimize the loss function to jointly training the process of LIR and GIR. And update the pre-trained embeddings with the interaction representations.
3.2 Hybrid language model pretrain-finetuning
Unsupervised representation learning, as a fundamental tool, has been shown effective in many language processing tasks (McCann2017L; Peters2018D; Howard2018U; Devlin2019B; Yang2019X). Here, we propose the hybrid language model pretrain-finetuning (HLMPf) method, which integrates the respective advantages in the PIF pipeline shown in Figure 1. The details of HLMPf are shown in Algorithm 2.
We first follow the BERT (Devlin2019B) model to train the language model pre-training layer. From step 2 to 3, we employ the fine-tuning strategy to fine-tune the interaction representation layer and the language model pre-training layer. After that, we follow the ELMo approach (Peters2018D) to obtain the context-aware word embeddings. From step 5 to 6, we show how to further fine-tune all neural layers following the feature-based strategy.
Specially, since fine-tuning all layers at once will result in catastrophic forgetting, we adopt the gradual unfreezing strategy (Howard2018U) to fine-tune all neural layers.
We start by providing an overview of the text representation model to be discussed in this paper and list the research questions that guide our experiments. Then we describe the task and datasets that we evaluate our proposals on. We conclude the section by specifyingthe settings of the parameters in our experiments.
|LSTM||A long and short-term memory network (LSTM) based representation model.||(Lai2015Recurrent)|
|C-CNN||A CNN based representation model in the character level.||(Zhang2015Character)|
|CoVe||A text representation model transferred from the machine translation model.||(McCann2017L)|
|ULMFiT||A text representation model based on general-domain language model pre-train, target task language model and classifier fine-tune.||(Howard2018U)|
|LIR||A text representation model based on the local interaction representation.||This paper|
|GIR||A text representation model based on the global interaction representation.||This paper|
|HIR||A text representation model based on the hybrid interaction representation.||This paper|
|LIR||A text representation model based on the local interaction representation model in the BERT fine-tuning architecture.||This paper|
|GIR||A text representation model based on the global interaction representation model in the BERT fine-tuning architecture.||This paper|
|HIR||A text representation model based on the hybrid interaction representation model in the BERT fine-tuning architecture.||This paper|
|LIR||A text representation model based on the local interaction representation model in the Pre-train Interact Fine-tune architecture.||This paper|
|GIR||A text representation model based on the global interaction representation model in the Pre-train Interact Fine-tune architecture.||This paper|
|HIR||A text representation model based on the hybrid interaction representation model in the Pre-train Interact Fine-tune architecture.||This paper|
|# training documents||25K||560K||5K||120K||560K|
|# text documents||2K||50K||0.5K||7.6K||70K|
4.1 Model summary and research questions
Table 1 list the models to be discussed. Among these models, LSTM, Char-level CNN, LIR, GIR and HIR models are neural based representation and don’t experience the pretrain-finetuning process.
Four state-of-the-art baselines: two neural based representation model (i.e., LSTM (Liu2016Recurrent), C-CNN (Zhang2015Character)), two language model pre-training based representation model (i.e., CoVe (McCann2017L), ULMFiT (Howard2018U)).
Nine flavors of approaches that we introduce in this paper: three interaction representation models (i.e., LIR, GIR and HIR), three interaction representation models in the BERT architecture (i.e., LIR, GIR, HIR) and the Pre-train, Interact, Fine-tune (PIF) architecture (i.e., LIR, GIR, HIR).
To assess the quality of our proposed interaction representation models and the PIF architecture, we consider a text classification task and seek to answer the following questions:
Does the interaction representation incorporated in the text representation model help to improve the performance for text classification?
Compared with the existing pretrain-finetuning approaches, does our proposed PIF architecture help to improve the model performance for text classification?
How does the trade-off parameter between LIR and GIR (as encoded in ) impact the performance of HIR related model in terms of classification accuracy?
Is the performance of our proposal sensitive to the length of text to be classified?
We evaluate our proposal on five publicly available datasets used in different application domains, e.g., sentiment analysis, questions classification and topic classification, which are widely used by the state-of-the-art models for text classification, e.g., CoVe (McCann2017L) and ULMFiT (Howard2018U). Table 2 details the statistics of the datasets. We use accuracy as the evaluation metric to compare the performance of discussed models.
Sentiment analysis mainly concentrates on the movie review and shopping review datasets. For example, IMDb dataset proposed by (Maas2011L) is a movie review dataset with binary sentimental labels. While Yelp dataset compiled by (Zhang2015Character) is a shopping review dataset that has two versions, i.e., binary and five-class version. We concentrate on the five-class version (Johnson2017D).
For question classification, Voorhees1999T collected open-domain fact-based questions and divided them into broad semantic categories, which has six-class and fifty-class versions. We mainly focus on the small six-class version and hold out 452 examples for validation and leave 5,000 for training, which is similar to (McCann2017L).
For topic classification, we evaluate our proposals on the task of news article and ontology classification. We use the AG news corpus collected by Zhang2015Character, which has four classes of news with only the titles and description fields. In addition, the DBpedia dataset, collected by Zhang2015Character, is used, which contains the title and abstract of each Wikipedia article with 14 non-overlapping ontology classes. In general, the dataset division is the same as in (Zhang2015Character).
4.3 Model configuration and training
For data preprocessing, we split the text into sentences and tokenized each sentence using Stanford’s CoreNLP (Manning2014T). In addition, we discard the words with single characters and other punctuation and convert the upper-case letters ton the lower-cases letters. In order to fit in the BERT pre-training, we add a special token for each sentence, e.g., [CLS] and [SEP]. The other data preprocessing follow the same way as (Johnson2017D)
For model configuration, we use the same set of hyper-parameters across all datasets to evaluate the robustness of our proposal. In the process of pre-training, we directly employ the trained 111https://github.com/google-research/bert as our language model pre-training layer for simplicity. As for the feature-based process, we follow the ELMo model222https://github.com/allenai/bilm-tf and employ AWD-LSTM (Merity2018R) on the trained BERT layer to get the context-aware word embeddings. For classifier fine-tuning layer, we adopt a softmax classifier and set the size of hidden layer to 100. In addition, we set the dimension of word embeddings and hidden representation in the interaction representation layer to 400 and 200, respectively. We also apply a dropout of 0.4 to layers and 0.05 to the embedding layers.
For the whole training process, we use a batch size of 64, a base learning rate of 0.004 and 0.01 for fine-tuning the interaction representation layer and the classifier fine-tuning layer, respectively. We employ a batch normalization mechanism (Ioffe2015B) to accelerate the training of the neural networks. Gradient clipping is applied by scaling gradients when the norm may exceed a threshold of 5 (Pascanu2013O). For the fine-tuning process, we adopt the gradual unfreezing strategy (Howard2018U) to fine-tune all neural layers.
5 Results and Discussion
In Section 5.1, we examine the performance of our proposal incorporated with the interaction representation and the HLMPf on five public datasets, which aims at answering RQ1 and RQ2. Then, in Section 5.2, we analyze the impact of the trade-off parameter in HIR related model to answer RQ3. Finally, to answer RQ4, section Section 5.3 focuses on investigating the impact on the text classification by varying the text length.
5.1 Performance comparison
5.1.1 Performances about the interaction representation
To answer RQ1, we first compare the performance of the basic interaction representation based models (i.e., LIR, GIR and HIR) with the baselines and present the results in Table 3
As to the baselines, we present two types of representation models, i.e., the neural-network based model (LSTM and C-CNN) and the pretrain-finetuning based model (CoVe and ULMFiT). For the neural-network based model, C-CNN achieves a better performance than LSTM. While in the pretrain-finetuning based model, ULMFiT is obviously the better one. Interestingly, comparing these two types of models, we can find that the representation models with pretrain-finetuning process have super advantages in terms of reducing error rate. Specially, with regard to C-CNN, ULMFiT reduces the error dramatically by 37.5%, 26.6%, 44.4%, 89.8% and 48.4% on the corresponding datasets (IMDb, Yelp, TREC, AG and DBpedia in order, which is the same in the following text). This may be due to the fact that the pre-training on a large text corpora can capture the deep syntactic and semantic information, which cannot be realized by only training on the neural networks.
Similarly, our proposals only with the interaction representation, i.e., LIR, GIR and HIR, cannot beat the state-of-the-art pretrain-finetuning based model, i.e., ULMFiT. But for the neural-network based baselines, our proposals can achieve better performance in terms of error rate. In particular, HIR is the best performing model among our proposals, which shows an improvement against the best neural-network based baseline, i.e., C-CNN, resulting in 8.6%, 9.9%, 16.0%, 26.3% and 20% reduction in terms of error rate on the respective datasets. LIR and GIR, following HIR, can outperform C-CNN on all datasets. The aforementioned findings indicate that compared with the traditional neural-network based models, modeling the interaction process explicitly can better capture the semantics relation between source elements in the text and generate more meaningful text representation. Especially for HIR, by representing the local and global interaction between words, it is more effective to improve the performance of the downstream applications.
5.1.2 Performances about the pretrain-finetuning
In section 5.1.1, the effectiveness of the pretrain-finetuning based and the interaction-related models have been proven. However, the basic interaction representation based models cannot beat the state-of-the-art pretrain-finetuning based model, i.e., ULMFiT. Hence, we incorporate them with the popular pretrai-finetuning architecture (i.e., BERT) and our PIF architecture to get the corresponding models (i.e., LIR, GIR, HIR and LIR, GIR, HIR), respectively. To answer RQ2, we compare the performance of these proposed models with ULMFiT and present their experimental results in Table 4.
Clearly, as shown in Table 4, our basic interaction-related models incorporated with the pretrain-finetuning process generally outperform the state-of-the-art model, i.e., ULMFiT, except for some cases, e.g., the LIR on DBpedia, GIR on AG and DBpedia, GIR on DBpedia. This findings again prove that our basic interaction representation models have the promising perspectives under the pretrain-finetuning architecture. With regard to the BERT architecture, our interaction-related models present the similar accuracy distribution to the basic interaction representation models in Table 3. HIR is the best performer using the BERT architecture, followed by LIR and GIR. Specially, for each dataset, HIR shows an obvious improvement of 7.9%, 5.6%, 6.4%, 2.6% and 2.5% against ULMFiT, respectively. While LIR, except on the DBpedia, also gains a minor improvement of 0.4%, 5.2%, 1.7%, 1.6% against ULMFiT, respectively. GIR, a bit worser than LIR, beats the ULMFiT on 3 out of 5 datasets.
The similar findings can also be found in the PIF architecture. In particular, HIR achieves the best performance not only in the PIF architecture but among all discussed models. Compared with the baseline ULMFiT, HIR gains substantial improvements of 12.1%, 9.7%, 7.5%, 3.2%, 3.8% in terms of error rate on respective datasets. In addition, LIR wins the comparisons against ULMFiT, resulting in 7.6%, 8.8%, 5.6%, 1.8% improvements on the respective datasets and an equal performance on DBpedia. While GIR defeats the ULMFiT model on 4 out 5 datasets.
Furthermore, comparing the same type of interaction models with different architectures (e.g., type LIR: LIR, LIR, LIR), we can find that there exists a unchanged ranking order of performance on each dataset, i.e., LIR LIR LIR, GIR GIR GIR, HIR HIR HIR. This ranking order demonstrates that our proposed PIF architecture that combines the feature-based and fine-tuning based strategies is the most effective architecture, followed by the fine-tuning based strategy, BERT. While the neural-network based models are worse than the former kinds of models.
5.2 Parameters analysis
Next we turn to RQ3 and conduct a parameter sensitivity analysis of our HIR related models, i.e., HIR, HIR and HIR. Clearly, as shown in Table 3 and Table 4, for different datasets, the same model has varied error rates on different orders of magnitude, e.g., HIR on IMDb and Yelp (). To better present the effect of the same model on different datasets, we introduce an evaluation metric, Relative Error Rate (RER), which is defined as, given a dataset, the relative improvement ratio of the lowest error rate with regard to the others with different . In addition, we examine the performances of these three models in terms of RER by gradually changing the parameters from 0 to 1 with an interval 0.1. We plot the RER results of HIR, HIR and HIR in Figure 2(a), Figure 2(b) and Figure 2(c), respectively.
As shown in Figure 2(a), HIR achieves the lowest error rate when on all datasets (except for Yelp dataset), which is in the figure. In addition, the RER of HIR on each dataset decreases consistently when varies from to ( for Yelp); after that, the RER metric goes up when changes from ( for Yelp) to . The similar phenomena can be found in Figure 2(b) and Figure 2(c). HIR and HIR both achieve the lowest error rate when . In addition, the RER of these two models on each dataset first keeps a stable decrease to the lowest point and then increases stably until .
Interestingly, comparing the curve gradient on both sides of , we can find that the gradient of the left side is steeper than that of the right side, which indicates that GIR can result in the increase of error rate more easily than LIR. Furthermore, comparing the same model on different datasets, we can find that the change ranges of RER on IMDb, DBpedia and TREC, is greater than that on Yelp and AG. The phenomena may be due to the differences of statistical characteristics among these datasets, which require further experiments to find potential reasons.
Curiously, we also want to find whether the relation HIR ¿ HIR ¿ HIR can always keep unchanged when the trade-off parameter increases from to . Due to the text space, we only select the dataset Yelp as the analytical object, which has the highest error rate among these datasets. We plot the experimental results in Figure 4.
Clearly, as Figure 4 shows, we can find that the performance of HIR is the lowest in terms of error rate, followed by HIR, and the highest is HIR, when increases from to . This result is consistent with the previous finding HIR HIR HIR, i.e., the effectiveness of our PIF architecture. On the other hand, it indicates that the effectiveness of our PIF architecture is not sensitive to the trade-off parameter .
5.3 Impact of the text length
To answer RQ4, we manually group the text according to the text length , e.g., 0–100, 100–200, , 900–1000, 1000. We campare the performance of interaction representation related models, e.g., LIR, GIR, HIR, HIR and HIR, under different settings of text length. We plot this experimental results in Figure 5
Clearly, as shown in Figure 5, we can find the relation LIR GIR HIR HIR HIR unchanged when text length increases. This phenomenon is consistent with the findings in Section 5.1.1, which indicates the effectiveness of interaction representation and PIF architecture is not affected by the text length.
Interestingly, as the text length increases, the performances of all discussed models decrease first to reach the lowest error rate at the point of group 100–200, and then keep a constant increase. This finding may be explained by the fact that the longer the text, the richer the information it provides, which results in targeting the class label of text more easily, i.e., the decrease of error rate in the earlier stage. But as the text length grows, the structure and semantics of text become more complex and variable, the proposed models find it harder to get the exact representation.
6 Conclusion and Future Work
In this paper, we focus on the task of text classification and propose a novel pipeline, the PIF architecture, which incorporates the respective advantages from feature based and fine-tuning based strategies in the language model pretrain-finetuning process. We also introduce the concept of interaction representation and propose a two-perspective interaction representation for sentence embeddings, i.e., a local interaction representation (LIR) and a global interaction representation (GIR). We combine these two representations to produce a hybrid interaction representation model, i.e., HIR.
We evaluate these models on five widely-used datasets for text classification. Our experimental results shows that: • (1) compared with the traditional neural-network based models, our basic interaction-related models can help boost the performance for text classification in terms of error rate. (2) our proposed PIF architecture is more effective to help improve the text classification than the existing feature-based as well as the fine-tuning based strategies. Specially, HIR model present the best performance on each dataset. (3) the effectiveness of interaction representation and the PIF architecture is not affected by the text length.
As to future work, we plan to evaluate our models for other tasks so as to verify the robustness of the interaction representation models. In addition, the existing fine-tuning approach is too general. We want to investigate some task-sensitive fine-tuning methods to better improve the performance.
This work was partially supported by the National Natural Science Foundation of China under No. 61702526, the Defense Industrial Technology Development Program under No. JCKY2017204B064, the National Advanced Research Project under No. 6141B0801010b, Ahold Delhaize, the Association of Universities in the Netherlands (VSNU), and the Innovation Center for Artificial Intelligence (ICAI). All content represents the opinion of the authors, which is not necessarily shared or endorsed by their respective employers and/or sponsors.