SuperChat: Dialogue Generation by Transfer Learning from Vision to Language using Two-dimensional Word Embedding and Pretrained ImageNet CNN Models
The recent work of Super Characters method using two-dimensional word embedding achieved state-of-the-art results in text classification tasks, showcasing the promise of this new approach. This paper borrows the idea of Super Characters method and two-dimensional embedding, and proposes a method of generating conversational response for open domain dialogues. The experimental results on a public dataset shows that the proposed SuperChat method generates high quality responses. An interactive demo is ready to show at the workshop.
Dialogue systems are important to enable machine to communicate with human through natural language. Given an input sentence, the dialogue system outputs the response sentence in a natural way which reads like human-talking. Previous work adopts an encoder-decoder architecture , and also the improved architectures with attention scheme added [1, 11, 12]. In architectures with attention, the input sentence are encoded into vectors first, and then the encoded vectors are weighted by the attention score to get the context vector. The concatenation of the context vector and the previous output vector of the decoder, is fed into the decoder to predict the next words iteratively. Generally, the encoded vectors, the context vector, and the decoder output vector are all one-dimensional embedding, i.e. an array of real-valued numbers. The models used in decoder and encoder usually adopt RNN networks, such as bidirectional GRU [1, 4], LSTM , and bidirectional LSTM . However, the time complexity of the encoding part is very expensive.
The recent work of Super Characters method has obtained state-of-the-art result for text classification on benchmark datasets in different languages, including English, Chinese, Japanese, and Korean. The Super Characters method is a two-step method. In the first step, the characters of the input text are drawn onto a blank image. Each character is represented by the two-dimensional embedding, i.e. an matrix of real-valued numbers. And the resulting image is called a Super Characters image. In the second step, Super Characters images are fed into a two-dimensional CNN models for classification. Examples of two-dimensional CNN models are used in Computer Vison (CV) tasks, such as VGG , ResNet , SE-net  and etc. in ImageNet .
In this paper, we propose the SuperChat method for dialogue generation using the two-dimensional embedding. It has no encoding phase, but only has the decoding phase. The decoder is fine-tuned from the pretrained two-dimensional CNN models in the ImageNet competition. For each iteration of the decoding, the image of text through two-dimensional embedding of both the input sentence and the partial response sentence is directly fed into the decoder, without any compression into a concatenated vector as done in the previous work.
2 The Proposed SuperChat Method
The proposed SuperChat method is motivated by the two-dimensional embedding used in the Super Characters method. If the Super Characters method could keep the same good performance when the number of classes in the text classification problem becomes even larger, e.g. the size of dialogue vocabulary, then the Super Characters method should be able to address the task of conversational dialogue generation. This can be done by treating the input sentence and the partial response sentence as one combined text input.
Figure 1 illustrates the proposed SuperChat method. The response sentence is predicted sequentially by predicting the next response word in multiple iterations. During each iteration, the input sentence and the current partial response sentence are embedded into an image through two-dimentional embedding. The resulting image is called as a SuperChat image. And then this SuperChat image is fed into a CNN model to predict the next response word. In each SuperChat image, the upper portion corresponses to the input sentence, and the lower portion corresponses to the partial response sentence. At the beggining of the iteration, the partial response sentence is initiallized as null. The prediction of the first response word is based on the SuperChat image with only the input sentence embedded, and then the predicted word is added to the current partial response sentence. This iteration continues until End Of Sentence (EOS) appeared. Then, the final output would be a concatenation of the sequential output.
The CNN model used in this method is fine-tuned from pre-trained ImageNet models to predict the next response word with the generated SuperChat image as input. It can be trained end-to-end using large dialogue corpus. Thus the problem of predicting the next response word in dialogue generation is converted into an image classification problem.
The training data is generated by labeling each SuperChat image as an example of the class indicated by its next response word. EOS is labeled to the SuperChat image if the response sentence is finished.
The cut-length of sentences is high-related to the font size of each character. For fixed image size, the larger cut-length means smaller font size for each character, and vice versa. On the one hand, we want to cover long sentences, which means the cut-length should be big, so there will be variety in both the input dialogue and the response dialogue. On the other hand, if we set the cutlength too big, the font size of each character will be small, and there could be large blank area for short sentences, which is a waste of the space on the image. The cut-length should be configured according to the sentence length distribution.
It should be also emphasized that the split of the image into input and response part could be not even. Depending on the statistics of the training data, maybe larger or smaller size could be assigned to response and input text. Also, the font size for each part does not need to be the same.
Although the examples used in Figure 1 is illustrated with Chinese sentences, however, it can be also applied to other languages. For example, Asian languages such as Japanese and Korean, which has the same square shaped characters as in Chinese. For Latin languages where words may have variant length, SEW method  could be used to convert the Latin languages also into the squared shape before applying the SuperChat method to generate the dialogue response.
Beam search  could be also used. In that case, instead of hard prediction for the first character, a soft prediction will be used to output all the possible sentences and one of the best will be selected as the final output.
The dataset used is Simsimi
Based on the distribution of the sentence length, we set cut length for input sentence at 18, and response cut length also at 18. So, altogether we have 36 characters within one SuperChat image, which could be a layout of 6 rows by 6 columns of characters. The input sentence takes the upper 3 rows, and the response sentence takes the lower 3 rows. For simplicity, we removed all the emoticons in the data set. In order to get enough samples for training, only characters whose frequency is not less than 1000 appearances are selected in the list of characters to predict. After this filtering, the remaining set is composed by the sentences with both input and response sentence length less than 18 characters, and all its characters in the list of the 528 frequent characters (including EOS). The resulting set is 178,192 pairs of dialogues, and a total of 989,087 SuperCharacter images are generated.
We set our image size at three channels of 224x224 grey image, in order to use the pretrained models on ImageNet. We also added a margin area for the four edges in the SuperChat image, which means the first character will not start from the pixel location of , but from instead. Here is the we set for the four edges. In this experiment, we set , which results in the remaining area is the square of x pixels. If we set same font size for both input and response sentence, it results in a font size of pixels. That means, each character takes an area of x pixels. The fonts used is the “simhei”.
3.1 Model Training
For each character, we split its labeled data into 75% for training and 25% for testing. Resulting in 739,289 training samples and 249,798 testing samples.
SE-net-154 is used in this experiment. The pretrained model on ImageNet is used as initialization
We can see that at the begining of the training, the curve climbs up quickly, and after it achieved 60%, the slope goes slowly and almost saturate to 64%. The x-axis is in unit of every 500 iterations, so totally it is one million iterations, with batch size of 5, wich accounts to about 7 epochs over the training data.
3.2 Sample Response Sentences
Table 1 are sample response sentences output by the SuperChat method. We can see the responses follow the grammar rules, and the style of the response sentences are funny and cute, as learnt from the training data.
|Sample ID||Input Sentence||Response Sentence|
|“Do you have boyfriend?”||“No,I only have a husband”|
|“Hmm! I won’t play with you if you didn’t miss me”||“I am funny, please play with me, please”|
|“The first time I saw you, I liked you”||“I like you too”|
|“What does that mean”||“Go Baidu it yourself”|
|“I am going sleep now”||“Good dream, wake you up tomorrow”|
|“When will you sleep”||“I will wait untill after you sleep”|
|“I am bored”||“Let’s chitchat”|
|“I am sad, and cried a lot”||“Don’t cry, not looking good when crying”|
|“Where are you from”||“I am from Mars”|
|“BTW what do you eat”||“I eat whatever you like to eat”|
|“You won’t know my heartbroken for you”||“I won’t lie to you. Not now, neither future.”|
|“So you are a beauty”||“I am a beauty!”|
|“Are you asleep”||“Nop, I am waiting for you.”|
|“Can you do rooster crowing”||“Cockadoodledoo!”|
|“Are you tied”||“You tired? Then go sleep. Honey”|
|“I miss her, what should I do”||“Go see her!”|
|“You seem to know everything”||“I don’t konw, but it happens I know something”|
|“I sent message to you. But you didn’t reply”||“I apologize”|
|“Dislike you”||“I dislike anyone who dislikes me”|
|“Guess what is my constellation”||“Gemini”|
In this paper, we propose the SuperChat method for dialogue response generation. It has no encoding, but only decodes the two-dimensional embedding of the input sentence and partial response sentence to predict the next response word iteratively. The pretrained two-dimensional CNN model is fine-tuned with the generated SuperChat images. The experimental results shows high quality response. An interactive demonstration is to show at the workshop.
- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
- J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.
- Markus Freitag and Yaser Al-Onaizan. Beam search strategies for neural machine translation. arXiv preprint arXiv:1702.01806, 2017.
- Jun Gao, Wei Bi, Xiaojiang Liu, Junhui Li, and Shuming Shi. Generating multiple diverse responses for short-text conversation. arXiv preprint arXiv:1811.05696, 2018.
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018.
- Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025, 2015.
- Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- Baohua Sun, Lin Yang, Catherine Chi, Wenhan Zhang, and Michael Lin. Squared english word: A method of generating glyph to use super characters for sentiment analysis. arXiv preprint arXiv:1902.02160, 2019.
- Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112, 2014.
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
- Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.