This paper presents Aicyber’s system for NLPCC 2017 shared task 2. It is formed by a voting of three deep learning based system trained on character-enhanced word vectors and a well known bag-of-word model.
Keywords:word embedding, text classification, CNN , LSTM
Aicyber’s System for NLPCC 2017 Shared Task 2: Voting of Baselines
The NLPCC shared task 2  evaluates the automatic classification techniques for very short texts, the Chinese news headlines. Participants are challenged to identify the category of given texts among 18 classes. The size of training, development and test are 156000, 36000, 36000. The classes in training set are roughly balanced and are equally distributed in development and test set. The evaluation metrics are macro-averaged precision, recall and F1 score as stated in task guideline 111http://tcci.ccf.org.cn/conference/2017/dldoc/taskgline02.pdf.
This paper describes the second best system submitted by team Aicyber with classification accuracy of 0.825. First, a system overview will be given, then each module will be introduced in detail.
2 The Aicyber’s System
The submitted system is a voting of three official baseline systems (NBoW, CNN and LSTM ) and a bag-of-word based SVM system.
Each baseline system’s prediction is a voting of that system trained on 5 different word vectors. The sub-systems’ architecture, experimental setup, training and development results will be introduced accordingly in the following session.
2.1 The Official Baseline Systems
Three deep learning systems are implemented and released as open source project 222https://github.com/FudanNLP/nlpcc2017_news_headline_categorization by organizer. They are: neural bag-of-words (NBoW) model [2, 3], convolutional neural networks (CNN)  and Long short-term memory network . Hands-on instructions are given to guide participants to reproduce and enhance the baseline systems. The accuracy reported in  are 0.783, 0.763 and 0.747 respectively.
The NBoW model takes an average of the word vectors in the input text and performs classification with a logistic regression layer. It is simple and computationally less expensive than CNN and LSTM system.
Not like NBoW model who doesn’t take the word order into account. The CNN and LSTM (and RNNs) model capture rich compositional information, and have achieved impressive performance in multiple benchmarks [4, 6].  suggested the CNN model need not be complex to realize strong results, as a simple one-layer CNN could achieved state-of-the-art results across several datasets. LSTM model has achieved remarkable performance in different sequence learning problems in speech, image and text analysis [8, 9]. It’s useful in capturing long-range dependencies in sequences.
The three systems share a similar pipeline for text classification, it takes word/char tokens as input, then tokens will go through word embeddings layers, followed by an average operation (NBoW) or CNN layer or LSTM layer, and a softmax layer at last.
Following sessions will focus on the pre-training of word embeddings layer.
2.1.1 Word Embeddings
Word embeddings is known as word2vec , by default is randomly initialized, for this evaluation pre-trained character and word level embeddings are provided. However we prefer two types of embedding which had superior performance compare with standard approach, these have been verified in dimensional sentiment analysis task 333https://github.com/StevenLOL/ialp2016_Shared_Task.
22.214.171.124 Character-enhanced Word Embedding
The first set of word embedding is character-enhanced word embedding  (CWE). Their study shows semantic meaning of a word is related to its composing characters. Two type of embeddings in CWE, the position-based character embeddings (CWE+P) and cluster-based character embeddings (CWE+L) are used.
They are trained with window size of 5 and 11, 5 iterations, 5 negative examples, minimum word count of 5, Skip-Gram with starting learning rate of 0.025 , the learned word vectors are of 300 dimensions.
126.96.36.199 FastText Embedding
The second set of word embedding is FastText  444https://github.com/facebookresearch/fastText, the idea is to enriching word vectors with sub-word information. Eg, for English, a word vector is associated to its character n-grams.
FastText word embedding is trained with similar setting as CWE training. Please noticed that default minimum character n-gram is 1 for Chinese.
188.8.131.52 Data Usage for Embedding Training
Following public available data-sets are used in unsupervised learning of word embeddings:
184.108.40.206 Training and Evaluation
Above dataset is preprocessed by jieba 777https://github.com/fxsjy/jieba, after filtering, there are 555571 unique tokens left. Embedding training produces six set of word vectors: CWE-L-W5,CWE-L-W11,CWE-P-W5,CWE-P-W11,FastText-W5 and FastText-W11 (W denotes window size).
UTF8gbsn To verify the correctness of word embedding, we examines the nearest neighbor of a given Chinese word, eg 高兴(happy). The result from FastText-W11 is clearly different from others, 5 single character words appear in the top 10 (0 for other embeddings). This indicates FastText doesn’t work properly for Chinese with large window size, in which the character n-grams, especially unigram become overestimate. Thus FastText-W11 is dropped.
2.1.2 Training of Official Baseline systems
With 5 embedding from above, 15 (5*3) systems are formed. We use default setting for NBoW, LSTM system. For the CNN system, only one convolution layer is used (filter size is 3).
System is trained only on the 156000 training data, and evaluated on 36000 development data, we use accuracy as performance metrics.
2.1.3 Result and Discussion
The results of official baselines are presented in Table 1, to make a fair comparison, systems trained on randomized character/word vectors (length is 300) are also included.
|Official Baseline systems|
|Network Type||Embeddings||Development Accuracy|
It’s obvious that systems trained with pre-trained embedding are much better than those with randomized embeddings. System with word embeddings give better result than those use characters embedding. CNN is the most accurate system. For different embedding types, the FastText is under performance the others. Difference between CWE-P and CWE-L is negligible.
2.2 Bag-of-Word model
The official released systems are relatively strong. We also seek alternatives to tackle classification problem. Starting with a well known baseline system, the bag-of-word model. It’s commonly used in text classification where the occurrence of each word is used as a feature for training classifiers. Support Vector Machine  (SVM) with linear kernel was considered to be one of the best classifiers [15, 16].
This system is trained on 156000 training data, and validated on 36000 development set. Table 2 shows BoW model could obtain 0.791 classification accuracy. The result is much better than all deep learning system with randomized embeddings, this finding demonstrate the importance of pre-trained word vectors.
|BoW Vs Deep-learning sytem with Randomized Word Vector|
|Randomized Word Vector||LSTM||0.728|
|Randomized Word Vector||CNN||0.763|
|Randomized Word Vector||NBoW||0.779|
To summarize, the submitted system is an ensemble of three deep learning based systems and a conventional BoW model, it’s truly a voting of baselines. The final classification accuracy is 0.826 measured on the development set.
3 Discussion and Further Improvement
Compare BoW method in Table 2 with the best single system in Table 1, the difference is only 0.03, we consider that the BoW indeed is well suited for the Chinese headline classification. Because the headline appears to be clear, concise and powerful, the usage of words in headline is precisely selected by professional editors. The importance of words make the BoW works well for this task.
The best single system achieved 0.824 classification accuracy on development set, while the voting system scored 0.826 on same dataset. Voting method provides marginal improvement in this work.
Word embedding training in Section 2.1.4 is kind of unsupervised pre-training, but only limited to the embedding layer. Study in  shows recurrent language models and sequence autoencoder could used to pre-train not only the embedding layer but also the LSTM layer. On five benchmarks that they tried, LSTMs can reach or surpass the performance levels of all previous baselines. As shown in Table 1, LSTM didn’t beat CNN or NBoW model, using pre-training could boost LSTM’s performance.
In this paper we presented our approaches to tackle Chinese news headline categorization challenge. A voting system consists of three deep-learning based system build on five different embedding layers and a BoW model ranked 2nd among 32 teams.
-  Xipeng Qiu and Jingjing Gong and Xuanjing Huang Overview of the NLPCC 2017 Shared Task: Chinese News Headline Categorization arXiv:1706.02883v1 (2017)
-  Kalchbrenner, Nal and Grefenstette, Edward and Blunsom, Phil A Convolutional Neural Network for Modelling Sentences Eprint Arxiv, 1, (2014)
-  Iyyer, Mohit and Manjunatha, Varun and Boyd-Graber, Jordan and Iii, Hal Daumé Deep Unordered Composition Rivals Syntactic Methods for Text Classification Meeting of the Association for Computational Linguistics and the International Joint Conference on Natural Language Processing, 1681-1691 (2015)
-  Kim, Yoon Convolutional Neural Networks for Sentence Classification, Eprint Arxiv (2014)
-  Hochreiter, Sepp and Schmidhuber, Jürgen Long Short-Term Memory Neural Computation ,9,8,1735 (1997)
-  Sheikh I, Illina I, Fohr D, et al. Learning Word Importance with the Neural Bag-of-Words Model The Workshop on Representation Learning for Nlp. 2016.
-  Zhang, Ye and Wallace, Byron A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification Computer Science 2015
-  Ghosh, Shalini and Vinyals, Oriol and Strope, Brian and Roy, Scott and Dean, Tom and Heck, Larry Contextual LSTM (CLSTM) models for Large scale NLP tasks 2016
-  Ji, Young Lee and Dernoncourt, Franck Sequential Short-Text Classification with Recurrent and Convolutional Neural Networks NAACL 515-520,2016
-  Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems, 26:3111–3119.
-  Steven Du, Zhang Xi Aicyber’s system for IALP 2016 shared task: Character-enhanced word vectors and Boosted Neural Networks International Conference on Asian Language Processing. IEEE, 2017.
-  Xinxiong Chen, Lei Xu, Zhiyuan Liu, Maosong Sun, and Huanbo Luan. 2015b. Joint learning of character and word embeddings. In International Conference on Artificial Intelligence.
-  Bojanowski P, Grave E, Joulin A, et al. Enriching Word Vectors with Subword Information[J]. 2016.
-  Vapnik, Vladimir N. The Nature of Statistical Learning Theory Springer 988 - 999 1995
-  Forman, George An extensive empirical study of feature selection metrics for text classification JMLR.org 1289-1305 2003
-  Yang, Yiming and Liu, Xin A re-examination of text categorization methods International ACM SIGIR Conference on Research and Development in Information Retrieval,42-49,1999
-  Bottou L. Large-Scale Machine Learning with Stochastic Gradient Descent[J]. 2010:177-186.
-  Dai A M, Le Q V. Semi-supervised Sequence Learning[J]. 2015:3079-3087.
-  Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, Ken Nakae , and Shin Ishii. Distributional smoothing with virtual adversarial training. In ICLR, 2016
-  Miyato T, Dai A M, Goodfellow I. Adversarial Training Methods for Semi-Supervised Text Classification[J]. 2017.
-  Goodfellow, Ian J, Shlens, Jonathon, and Szegedy, Christian. Explaining and harnessing adversarial examples. In International Conference on Learning Representation, 2015