Robust Text Classifier on Test-Time Budgets

Robust Text Classifier on Test-Time Budgets

Md Rizwan Parvez
University of California Los Angeles

Kai-Wei Chang
University of California Los Angeles

&Tolga Bolukbasi
Boston University

Venkatesh Saligrama
Boston University

We design a generic framework for learning a robust text classification model that achieves high accuracy under different selection budgets (a.k.a selection rates) at test-time. We take a different approach from existing methods and learn to dynamically filter a large fraction of unimportant words by a low-complexity selector such that any high-complexity classifier only needs to process a small fraction of text, relevant for the target task. To this end, we propose a data aggregation method for training the classifier, allowing it to achieve competitive performance on fractured sentences. On four benchmark text classification tasks, we demonstrate that the framework gains consistent speedup with little degradation in accuracy on various selection budgets.

1 Introduction

Recent advances in deep neural networks (DNNs) have achieved high accuracy on many text classification tasks. These approaches process the entire text and encode words and phrases in order to perform target tasks. While these models realize high accuracy, the computational time scales linearly with the size of the documents, which can be slow for a long document. In this context, various approaches based on modifying the RNN or LSTM architecture have been proposed to speed up the process skim-rnn; learning-to-skim. However, the processing in these models is still fundamentally sequential and needs to operate on the whole document which limits the computational gain.

Figure 1: Our proposed framework. Given a selection rate, a selector is designed to select relevant words and pass them to the classifier. To make the classifier robust against fractured sentences, we aggregate outputs from different selectors and train the classifier on the aggregated corpus.

In contrast to previous approaches, we propose a novel framework for efficient text classification on long documents that mitigates sequential processing. The framework consists of a selector and a classifier. Given a selection budget as input, the selector performs a coarse one-shot selection deleting unimportant words and pass the remainder to the classifier. The classifier then takes the sentence fragments as an input and performs the target task. Figure  1 illustrates the procedure. This framework is general and agnostic to the architecture of the downstream classifier (e.g., RNN, CNN, Transformer).

However, three challenges arise. First, to build a computationally inexpensive system, the selector must have negligible overhead. We adopt two effective yet simple architectures to design selectors based on word embeddings and bag-of-words. Second, training multiple distinct models for different budgets is unfeasible in practice, especially when model size is large. Hence, our goal is to learn a single classifier that can adapt to the output of any selector operating at any budget. Consequently, this classifier must be robust so that it can achieve consistent performance with different budgets. Third, the input to the classifier in our framework is a sequence of fractured sentences which is incompatible with a standard classifier that trained on the full texts, causing its performance degrades significantly. One potential but unfeasible solution is to train the classifier with a diverse collection of sentence fragments which is combinatorially numerous. Another approach is to randomly blank out text (a.k.a. blanking-noise), leads to marginalized feature distortion (maaten2013learning) but this also leads to poor accuracy as DNNs leverage word combinations, sentence structure, which this approach does not account for. To mitigate this problem, we propose a data aggregation framework that augments the training corpus with outputs from selectors at different budget levels. By training the classifier on the aggregated structured blank-out text, the classifier learns to fuse fragmented sentences into a feature representation that mirrors the representation obtained on full sentences and thus realizes high-accuracy. We evaluate our approach through comprehensive experiments on real-world datasets. 111Our source code is available at:

2 Related Work

Several approaches have been proposed to speed up the DNN in test time Kilian; choi. LSTM-jump (learning-to-skim) learns to completely skip words deemed to be irrelevant and skim-RNN (skim-rnn) uses a low-complexity LSTM to skim words rather than skipping. Another version of LSTM-jump, LSTM-shuttle speed-emnlp18 first skips a number of words, then goes backward to recover lost information by reading some words skipped before. All these approaches require to modify the architecture of the underlying classifier and cannot easily extend to another architecture. In contrast, we adopt existing classifier architectures (e.g., LSTM, BCN (bcn)) and propose a meta-learning algorithm to train the model. Our framework is generic and a classifier can be viewed as a black-box. Similar to us, tao-lei propose a selector-classifier framework to find text snippets as justification for text classification but their selector and classifier have similar complexity and require similar processing times; therefore, it is not suitable for computation gain. Various feature selection approaches (chandrashekar2014survey) have been discussed in literature. For example, removing predefined stop-words (see Appendix A), attention based models  bahdanau; luong, feature subspace selection methods (e.g., PCA), and applying the L1 regularization (e.g., Lasso (lasso1) or Group Lasso (faruqui2015sparse), BLasso (gao)). However, these approaches either cannot obtain sparse features or cannot straightforwardly be applied to speed up a DNN classifier. Different from ours, viola2001robust; trapeznikov2013supervised; karayev2013dynamic; xu2013cost; kusner2014feature; bengio2015conditional; leroux2017cascading; zhu19; NIPS2017_7058; pmlr-v70-bolukbasi17a focus on gating various components of existing networks. Finally, aggregating data or models has been studied under different contexts (e.g., in context of reinforcement learning (daggar), Bagging models (breiman1996bagging), etc.) while we aggregate the data output from selectors instead of models.

3 Classification on a Test-Time Budget

Our goal is to build a robust classifier along with a suite of selectors to achieve good performance with consistent speedup under different selection budgets at test-time. Formally, a classifier takes a word sequence and predicts the corresponding output label , and a selector with selection budget takes an input word sequence and generates a binary sequence where represents if the corresponding word is selected or not. We denote the sub-sequence of words generated after filtering by the selector as . We aim to train a classifier and the selector such that is sufficient to make accurate prediction on the output label (i.e., ). The selection budget (a.k.a selection rate) is controlled by the hyper-parameters of the selector. Higher budget often leads to higher accuracy and longer test time.

3.1 Learning a Selector

We propose two simple but efficient selectors. Word Embedding (WE) selector. We consider a parsimonious word-selector using word embeddings (e.g., GloVe glove) as features to predict important words. We assume the informative words can be identified independently and model the probability that a word is selected by , where is the model parameters of the selector , is the corresponding word vector, and is the sigmoid function. As we do not have explicit annotations about which words are important, we train the selector along with a classifier in an end-to-end manner following tao-lei, and an L1-regularizer is added to control the sparsity (i.e., selection budget) of .

Bag-of-Words selector. We also consider using an L1-regularized linear model (elasticnet; ng_L1; kai_L1) with bag-of-words features to identify important words. In the bag-of-words model, for each document , we construct a feature vector , where is the size of the vocabulary. Each element of the feature vector represents if a specific word appearing in the document . Given a training set , the linear model optimizes the L1-regularized task loss. For example, in case of a binary classification task (output label ),

where is a weight vector to be learned, corresponds to word , and is a hyper-parameter controlling the sparsity of (i.e., selection budget). The lower the budget is, the sparser the selection is. Based on the optimal , we construct a selector that picks word if the corresponding is non-zero. Formally, the bag-of-words selector outputs where is an indicator function.

Model SST-2 IMDB AGNews Yelp acc. selection(%) time speedup acc. selection(%) time speedup acc. selection(%) time speedup acc. selection(%) time speedup Baseline 85.7 100 9 1x 91.0 100 1546 1x 92.3 100 59 1x 66.5 100 3487 1x Bag-of-Words 78.8 75 5.34 1.7x 91.5 91 1258 1.2x 92.9 97 48 1.2x 59.7 55 2325 1.6x Our framework 82.6 65 4.6 2x 92.0 91 1297 1.2x 93.1 91 46 1.3x 64.8 55 2179 1.6x 85.3 0 9 1x 92.1 0 1618 1x 93.2 0 57 1x 66.3 0 3448 1x
Table 1: Accuracy and speedup on the test datasets. Test-times are measured in seconds. The speedup rate is calculated as the running time of a model divided by the running time of the corresponding baseline. For our framework, top row denotes the best speedup and the bottom row denotes the best test accuracy achieved. Overall best accuracies and best speedups are boldfaced. Our framework achieves accuracies better than baseline with a speedup of 1.2x and 1.3x on IMDB, and AGNews respectively. With same or higher speedup, our accuracies are much better than Bag-of-Words.

3.2 The Data Aggregation Framework

In order to learn to fuse fragmented sentences into a robust feature representation, we propose to train the classifier on the aggregated corpus of structured blank-out texts.

Given a set of training data , we assume we have a set of selectors with different budget levels trained by the framework discussed in Section 3.1. To generate an aggregated corpus, we first apply each selector on the training set, and generate corresponding blank-out corpus . Then, we create a new corpus by aggregating the blank-out corpora: .222Note that, the union operation is used just to aggregate the train instances which does not hinder the model training (e.g., discrete variables). Finally, we train the classifier on the aggregated corpus . As is trained on documents with distortions, it learns to make predictions with different budget levels. The training procedure is summarized in Algorithm 1. In the following, we discuss two extensions of our data aggregation framework.

First, the blank-out data can be generated from different classes of selectors with different features or architectures. Second, the blank-out and selection can be done in phrase or sentence level. Specifically, if phrase boundaries are provided, a phrase-level aggregation can avoid a selector from breaking compound nouns or meaningful phrases (e.g., “Los Angeles”, “not bad“). Similarly, for multi-sentenced documents, we can enforce the selector to pick a whole sentence if any word in the sentence is selected.

Input: Training corpus , a set of selectors with different budget levels , classifier class
Output: A robust classifier
1 Initialize the aggregated corpus: for  do
2       Train a selector with budget level on Generate a blank-out dataset Aggregate data:
Train a classifier on return
Algorithm 1 Data Aggregated Training Schema
Dataset #class Vocabulary Size (Train/Valid/Test) Avg. Len
SST 2 13,750 6,920/872/1,821 19
IMDB 2 61,046 21,143/3,857/25,000 240
AGNews 4 60,088 101,851/18,149/7,600 43
Yelp 5 1,001,485 600k/50k/50k 149
Table 2: Dataset statistics.
(a) IMDB
(b) AGNews
(c) SST-2
Figure 2: Performance under different test-times on IMDB, AGNews, and SST-2. All the approaches use the same LSTM model as the back-end. Bag-of-Words model and our framework have the same bag-of-words selector cascaded with this LSTM classifier trained on the original training corpus and aggregated corpus, respectively. Our model (blue dashed line) significantly outperform others for any test-time budget. Also its performance is robust, while results of skim-RNN is inconsistent with different budget levels.
World News .. plant searched. Kansai Electric Power’s nuclear power plant in Fukui .. was searched by police Saturday ..
Business Telecom Austria taps the Bulgarian market. Telecom Austria, Austrias largest telecoms operator, obtained ..
Sci/Tech .. Reuters - Software security companies and handset makers, including Finland’s Nokia (NOK1V.HE), are ..
Table 3: Examples of the WE selector output on AGNews. Bold words are selected.

4 Experiments

To evaluate the proposed approach, we consider four benchmark datasets: SST-2 (sst), IMDB maas, AGNews agnews, and Yelp verydeepconv and two widely used architectures for classification: LSTM, and BCN (bcn). The statistics of the datasets are summarized in Table 2. We evaluate the computation gain of models in terms of overall test-time, and the performance in terms of accuracy. We follow skim-rnn to estimate the test-time of models on CPU333Machine specification is in Appendix C. and exclude the time for data loading.

In our approach, we train a classifier with both WE and bag-of-words selectors with 6 selection budgets444For the very large Yelp dataset, 3 selection budgets {50%, 60%, 70%} are used. {50%, 60%, …, 100%} by the word-level data aggregation framework. We evaluate the computation gain of the proposed method through a comparative study of its performance under different test-times by varying the selection budgets555In Appendix B, we discuss how to vary these budgets. in comparison to the following approaches: (1) Baseline: the original classifier (i.e., no selector, no data aggregation) (2) skim-RNN: we train a skim-RNN model and vary the amount of text to skim (i.e., test-time) by tuning parameter as in skim-rnn. (3) Bag-of-Words: filtering words by the bag-of-words selector and feeding the fragments of sentences to the original classifier (i.e., no data aggregation). This approach serves as a good baseline and has been considered in the context of linear models (e.g., chang2008feature). For a fair comparison, we implement all approaches upon the same framework using AllenNLP library666, including a re-implementation of the existing state-of-art speedup framework skim-RNN skim-rnn777The official skim-RNN implementation is not released.. As skim-RNN is designed specifically for accelerating the LSTM model, we only compare with skim-RNN using LSTM classifier. Each corresponding model is selected by tuning parameters on validation data. The model is then frozen and evaluated on test-data for different selection budgets.

Figure 2 demonstrates the trade-off between the performance, and the test-time for each setting. Overall, we expect the error to decrease with a larger test-time budget. From Figure 2, on all of the IMDB, AGNews, and SST-2 datasets, LSTM classifier trained with our proposed data aggregation not only achieves the lowest error curve but also the results are robust and consistent. That is our approach achieves higher performance across different test-time budgets and its performance is a predictable monotonic function of the test-time budget. However, the performance of skim-RNN exhibits inconsistency for different budgets. As a matter of fact, for multiple budgets, none of the skim-RNN, and LSTM-jump address the problem of different word distribution between training and testing. Therefore, similar to skim-RNN, we anticipate that the behavior of LSTM-jump will be inconsistent as well888As an example, from Table 6 in learning-to-skim, the performance of LSTM-jump drops from 0.881 to 0.854 although it takes longer test-time (102s) than the baseline (81.7s).. Additionally, since LSTM-jump has already been shown to be outperformed by skim-RNN, we do not further compare with it. Next, we show that our framework is generic and can incorporate with other different classifiers, such as BCN (see Table 1).999Because of the inherent accuracy/inference-time tradeoff, it is difficult to depict model comparisons. For this reason, in Figure 2, we plot the trade-off curve to demonstrate the best speedup achieved by our model for achieving near state-of-art performance. On the other hand, test results are tabulated in Table 1 to focus attention primarily on accuracy. When phrase boundary information is available, our model can further achieve 86.7 in accuracy with 1.7x speedup for BCN on SST-2 dataset by using phrase-level data aggregation. Finally, one more advantage of the proposed framework is that the output of the selector is interpretable. In Table 3, we present that our framework correctly selects words such as “Nokia”, “telecom”, and phrases such as “searched by police”, “software security” and filters out words like “Aug.”, “users” and “products”.

Note that nevertheless we focus on efficient inference, empirically our method is no more complex than the baseline during training. Despite the number of training instances increases, and so does the training time for each epoch, the number of epochs we require for obtaining a good model is usually smaller. For example, on the Yelp corpus, we only need 3 epochs to train a BCN classifier on the aggregated corpus generated by using 3 different selectors, while training on the original corpus requires 10 epochs.

5 Conclusion

We present a framework to learn a robust classifier under test-time constraints. We demonstrate that the proposed selectors effectively select important words for classifier to process and the data aggregation strategy improves the model performance. As future work we will apply the framework for other text reading tasks. Another promising direction is to explore the benefits of text classification model in an edge-device setting. This problem naturally arises with local devices (e.g., smart watches or mobile phones), which do not have sufficient memory or computational power to execute a complex classifier, and instances must be sent to the cloud. This setting is particularly suited to ours since we could choose to send only the important words to the cloud. In contrast, skim-RNN and LSTM-jump, which process the text sequentially, have to either send the entire text to the server or require multiple rounds of communication between the server and local devices resulting in high network latency.

6 Acknowledgments

We thank the anonymous reviewers for their insightful feedback. We also thank UCLA-NLP group for discussion and comments. This work was supported in part by National Science Foundation grants IIS-1760523 and CCF-1527618.


Appendix A Stop-words Removing:

Our preliminary experiments show that although Stop-words achieves notable speedup, it sometimes comes with a significant performance drop. For example, removing Stop-words from SST-2 dataset, the test-time is  2x faster but the accuracy drops from 85.5 to 82.2. This is due to the stop-words used for filtering text are not learned with the class labels; therefore, some meaningful words (e.g., “but”, “not”) are filtered out even if they play a very significant role in determining the polarity of the full sentence (e.g., “cheap but not healthy”). Besides, we cannot control the budget in the Stop-words approach.

Appendix B Hyperparameter Tuning:

As the performance is proportionate to the text selected, controlling the selection budget we indeed control the performance. In this section we discuss how to control the selection budget by tuning the hyperparameters.

b.1 Tuning the WE selector:

For the WE selector, we vary the selection budget by tuning the two hyperparameters sparsity (), and coherent ( of tao-lei. In the table below we provide an example settings for corresponding fraction of text to select.

Sparsity ( Continuity ( Selection rate (%)
8.5e-05 2.0 2.0
8.5e-05 1.0 3.0
9.5e-05 2.0 5.0
9.5e-05 1.0 6.0
0.0001 2.0 9.0
0.0001 1.0 12.0
0.000105 2.0 13.0
0.000105 1.0 15.0
0.00011 2.0 16.0
0.00011 1.0 22.0
0.000115 2.0 23.0
0.000115 1.0 24.0
0.00012 2.0 28.0
0.00012 1.0 64.0

b.2 Tuning the Bag-of-Words selector:

As an example, the following is the regularization hyper-parameter 101010 and corresponding selection rate by the bag-of-words selector on IMDB.

C Selection rate (%)
0.01 27
0.05 37
0.1 53
0.11 63
0.15 66
0.25 73
0.7785 79
1.5 82
2.5 88

b.3 Tuning skim-RNN:

We re-implement the skim-RNN model as the same baseline as ours with large RNN size , and small RNN sizes , and . For results in Table 2 (in main paper), we compare our model with the best results found from the skim-RNN models with different , and . For IMDB, we found the best speedup and accuracy with and hence for Figure 2 (in main paper), we consider this model with and vary the selection threshold at inference time as described in  skim-rnn for getting different selection of words. We report the accuracy and the test-time for each setting and plot it in Figure 2 (in main paper). The following is the selection thresholds for IMDB.

0.45 99
0.48 97
0.47 93
0.505 63
0.51 54
0.52 34
0.53 20

Appendix C Machine Specification:

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                12
On-line CPU(s) list:   0-11
Thread(s) per core:    2
Core(s) per socket:    6
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 63
Stepping:              2
CPU MHz:               1200.890
BogoMIPS:              6596.22
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              15360K
NUMA node0 CPU(s):     0-11
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description