Train Once, Test Anywhere: Zero-Shot Learning for Text Classification

# Train Once, Test Anywhere: Zero-Shot Learning for Text Classification

## Abstract

Zero-shot Learners are models capable of predicting unseen classes. In this work, we propose a Zero-shot Learning approach for text categorization. Our method involves training model on a large corpus of sentences to learn the relationship between a sentence and embedding of sentence’s tags. Learning such relationship makes the model generalize to unseen sentences, tags, and even new datasets provided they can be put into same embedding space. The model learns to predict whether a given sentence is related to a tag or not; unlike other classifiers that learn to classify the sentence as one of the possible classes. We propose three different neural networks for the task and report their accuracy on the test set of the dataset used for training them as well as two other standard datasets for which no retraining was done. We show that our models generalize well across new unseen classes in both cases. Although the models do not achieve the accuracy level of the state of the art supervised models, yet it evidently is a step forward towards general intelligence in natural language processing.

## 1Introduction

Zero-shot learning has been an area of special interest in recent years. It not only allows scaling of algorithms across unseen classes but also can be used across datasets as we try to show in this work. In this work, we report a methodology that can be used for zero-shot learning in the case of text categorization. To achieve this, we model the task of text categorization as a binary classification problem of finding relatedness between sentences and categories. Models trained in this fashion learn the relatedness (yes/no) of the given sentence for each category separately, instead of predicting the given class as in multiclass-multilabel classification (Figure ?). For instance, the sentence - “Obama said that GOP’s efforts to repeal health care were aggravating” is trained to belong to categories ‘politics’ and ‘healthcare’ but not to categories ‘sports’ and ‘technology’. Table 1 lists other examples of sentences belonging to multiple and overlapping text categories. The proposed redefinition allows neural network model trained on one dataset to be deployed for text categorization on other datasets without the need of retraining the model.

We further propose neural network architectures that can be used as zero-shot learners with this technique. While one model is a single layered neural network on mean of word embeddings, the other two models use LSTMs to model sentence as a sequence.

We believe that training networks with a large amount of data with noisy annotation leads to more generalized models as compared to training with smaller datasets that are annotated specifically for underlying tasks. Moreover, utilization of noisy annotated data from open web saves annotation cost. Therefore, we trained our model with news headlines crawled from around the web with their Search Engine Optimization (SEO) tags as categories (also called the source dataset) and test its performance. We further test our model on News aggregator [5] and tweet classification [7] datasets, hence showing the concept of relatedness it learns is useful across datasets. Briefly, the contributions of this paper are three-fold:-

1. We propose a zero-shot learning framework for zero shot text categorization as binary classification task to find relatedness between text and categories. We show that this framework can adapt to any number of text categories as well as across datasets, without the need of re-training or fine-tuning the model.

2. We propose three neural network architectures that can use the technique above and can be used for zero-shot classification.

3. We report accuracy of the zero-shot learning capability of our model trained on source dataset on different datasets and compare it with state-of-the-art results obtained through models that were specifically trained on those datasets. We show that our architecture can generalize to classes it has not seen and even datasets it has not been trained on.

## 2Related Work

Many zero-shot learning approaches have been proposed in the domain of computer vision [8], [9]. However, there exists a very limited amount of work towards zero-shot learning in the domain of natural language processing. To the best of our understanding, this is the first work to report a zero shot learning solution for text categorization. Our first architecture is a single layered neural network on concatenation of 1. mean embedding of the sentence and 2. the embedding of the tag. It is inspired by shallow architectures which get good scores on text classification tasks like [3]. The second architecture, instead of taking a mean of embeddings before passing it to classification layer, tries to model the sequence using an LSTM [2] . Our third LSTM architecture may be considered similar to the architecture used by a [10] for aspect-based sentiment analysis. Instead of the “Aspect Term”, we pass the embedding of the tag to be considered related/not-related. However, we do not employ the component of attention as mentioned in the work by Wang et. al.

## 3Data

### 3.1Source Dataset

We crawled more than 4,200,000 news headlines from around the web along with their SEO tags. The corpus had more than 300,000 unique SEO tags. For simplicity, we henceforth refer to news headlines and SEO tags as sentences and tags, respectively. A news article can have more than one SEO tags. In such cases, we added multiple instances of the sentence to our data, one for each tag.

During data preparation, we fixed the sentence length to contain 28 words by truncating longer sentences and repeating words for shorter sentences. The Source Dataset was split into a train and test set.

### 3.2Test Datasets

The algorithm was trained on the training set of the source dataset and its accuracy of text categorization was tested on two other open datasets. We used UCI News Aggregator and tweet classification datasets as test datasets.

#### UCI News Aggregator

The dataset contained more than 420,000 sentences belonging to four categories: technology, business, medicine, and entertainment. We report our accuracy on the entire dataset. Since the granularity level of categories is different from SEO tags in Source Dataset, we use the concept of category tree for UCI-Aggregator dataset (Refer Testing section for more details).

#### Tweet Classification

The dataset contains 6 categories: business, health, politics, sports, technology, and entertainment. The dataset has 1993 sentences and we used all the sentence for testing. Our model can classify the dataset directly using embeddings of categories (health, politics etc.), but we get even better results using category tree as for UCI-Aggregator dataset.

## 4Architectures

Three different architectures are tried out for zero-shot learning and results are reported. They are referred to as Architecture 1, 2, and 3 respectively. We initialized word embedding with a pretrained embedding [1] for all three. For notation, let’s consider the tag’s embedding is and the word embeddings of the sentence are [, .. ].

### 4.1Architecture 1

We concatenate the dimensionwise mean of [, .. ] with and pass it thorugh a fully connected layer to classify if sentence and tag are related.

### 4.2Architecture 2

The input sent at a time step t, to the LSTM is [], where is the embedding of word of the sentence. We concatenate the last hidden state of the network with and pass it thorough a fully connected layer to classify if sentence and tag are related.

### 4.3Architecture 3

The input sent at a time step t, to the LSTM is [:], where is the embedding of word of the sentence. We use the last hidden layer of LSTM and predict if it is related to tag’s embedding .

## 5Training

Our dataset contained more than 300,000 tags, making the approach of training with multi-class classification intractable. Therefore, we converted the tag prediction task to binary classification task, where the model predicts whether a given sentence is related to given tag or not. In other words, we train our algorithm for generic knowledge of whether a given sentence is related to a particular class or not. This is different from the knowledge “which class among a set of classes does a particular sentence belongs to?”, learned by a normal multi-class classifier. Figure ? tries to explain this visually. We trained each sentence with 50% related and 50% randomly selected unrelated tags. We trained the model for binary cross entropy loss with Adam optimizer [4].

## 6Testing

We tested our models on UCI News Aggregator and Tweet Classification dataset. There is a slight difference between text classes used in these datasets and SEO tags of our Source Dataset, SEO tags are more atomic concepts as compared to UCI classes. For example, while SEO tags for a sentence “Bitcoin futures could open the floodgates for institutional investors” would be Bitcoin, Commodity, Futures, Cryptocurrency, Hedge Funds, and Mutual Funds, the tags in UCI/Tweet Categorizer dataset would be classes representing broader concepts i.e. Business and Technology. Therefore, to test the accuracy of our classifier on these classes, we first create a category tree for these datasets to list multiple tags that would belong to each class. For example, tags like “forex”, “financial markets”, “stocks”, “production” and “Business” itself might belong to category tree under “Business” class. To predict relatedness of a given sentence to a particular class, we first predict the relatedness probability of all the tags listed for that class and take their mean. The sentence is then classified into the classes that have mean relatedness probability above a certain threshold for that sentence. The threshold of relatedness score is a hyperparameter. This technique allowed our model to function across different levels of granularity of the concepts in which text can be classified. Making the category tree for either datasets is just a work of few minutes, thinking of what all concepts might belong to a particular class and listing them down. For example, when testing on Tweet Categorization dataset, the category tree is just three tags per class.

## 7Results

The models trained using architectures 1, 2 and 3 achieved 72%, 72.6% and 74% accuracy respectively on test set of Source Dataset for the binary classification task. For the tags which are not present in the training set and only in the test set, the accuracy is even higher at 78%,76% and 81% respectively. Further, the models achieve 61.73%, 63% and 64.21% accuracy respectively on the News Aggregator dataset using a category tree at the threshold of 0.5 on relatedness score. The reported accuracy is much lesser than the state-of-the-art accuracy (94.75%) [6] on this dataset. However, considering that our model had not even seen a single sample from the given dataset as opposed to fully supervised methods, the reported results are still remarkable.

We evaluated the performance of our model on tweet classification dataset using a category tree on a threshold of 0.5 relatedness scores. Architecture 1,2 and 3 got 64%, 53% and 64.5% accuracy on the dataset. In contrast to best results of supervised model like SVC and multinomial naive bayes, which have 74% and 78% accuracies respectively,our models are not trained on the dataset. If we do not use a category tree and use direct class names to classify, we can still get 49% accuracy using architecture 3.

## 8Conclusion

In this work, we introduce techniques and models that can be used for zero-shot classification in text. We show that our models can get better than random classification accuracies on datasets without seeing even one example. We can say that this technique learns the concept of relatedness between a sentence and a word that can be extended beyond datasets. That said, the levels of accuracy leave a lot of scope for future work.

### References

1. Google Code Archive - Long-term storage for Google Code Project Hosting., 2013.
Google News Embedding. URL https://code.google.com/archive/p/word2vec/.
2. LONG SHORT-TERM MEMORY.
Sepp Hochreiter and Jj Urgen Schmidhuber. Neural Computation
3. Fasttext.zip: Compressing text classification models.
Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Hérve Jégou, and Tomas Mikolov. arXiv preprint arXiv:1612.03651
4. Adam: A Method for Stochastic Optimization.
Diederik P. Kingma and Jimmy Ba. 12 2014.
5. UCI Machine Learning Repository, 2013.
M Lichman. URL http://archive.ics.uci.edu/ml.
6. Classifying with Logistic Regression (0.9473) | Kaggle, 2017.
Luis Bronchal. URL https://www.kaggle.com/lbronchal/classifying-with-logistic-regression-0-9473.
7. Tweet Classification, 2017.
Parassharmaa. URL https://github.com/Parassharmaa/Tweet-Classifier.
8. Multi-Label Zero-Shot Learning via Concept Embedding.
Ubai Sandouk and Ke Chen. 6 2016.
9. Zero-Shot Learning Through Cross-Modal Transfer.
Richard Socher, Milind Ganjoo, Hamsa Sridhar, Osbert Bastani, Christopher D. Manning, and Andrew Y. Ng. 1 2013.
10. Attention-based LSTM for Aspect-level Sentiment Classification.
Yequan Wang, Minlie Huang, Li Zhao, and Xiaoyan Zhu. pp. 606–615, 2016.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minumum 40 characters