Backward and Forward Language Modeling for Constrained Sentence Generation
Recent language models, especially those based on recurrent neural networks (RNNs), make it possible to generate natural language from a learned probability. Language generation has wide applications including machine translation, summarization, question answering, conversation systems, etc. Existing methods typically learn a joint probability of words conditioned on additional information, which is (either statically or dynamically) fed to RNN’s hidden layer. In many applications, we are likely to impose hard constraints on the generated texts, i.e., a particular word must appear in the sentence. Unfortunately, existing approaches could not solve this problem. In this paper, we propose a novel backward and forward language model. Provided a specific word, we use RNNs to generate previous words and future words, either simultaneously or asynchronously, resulting in two model variants. In this way, the given word could appear at any position in the sentence. Experimental results show that the generated texts are comparable to sequential LMs in quality.
Language modeling is aimed at minimizing the joint probability of a corpus. It has long been the core of natural language processing (NLP) , and has inspired a variety of other models, e.g., the -gram model, smoothing techniques , as well as various neural networks for NLP [2, 6, 17]. In particular, the renewed prosperity of neural models has made groundbreaking improvement in many tasks, including language modeling per se , part-of-speech tagging, named entity recognition, semantic role labeling , etc.
The recurrent neural network (RNN) is a prevailing class of language models; it is suitable for modeling time-series data (e.g., a sequence of words) by its iterative nature. An RNN usually keeps one or a few hidden layers; at each time slot, it reads a word, and changes its state accordingly. Compared with traditional -gram models, RNNs are more capable of learning long range features—especially with long short term memory (LSTM) units  or gated recurrent units (GRU) —and hence are better at capturing the nature of sentences. On such a basis, it is even possible to generate a sentence from an RNN language model, which has wide applications in NLP, including machine translation , abstractive summarization , question answering , and conversation systems . The sentence generation process is typically accomplished by choosing the most likely word at a time, conditioned on previous words as well as additional information depending on the task (e.g., the vector representation of the source sentence in a machine translation system ).
In many scenarios, however, we are likely to impose constraints on the generated sentences. For example, a question answering system may involve analyzing the question and querying an existing knowledge base, to the point of which, a candidate answer is at hand. A natural language generator is then supposed to generate a sentence, coherent in semantics, containing the candidate answer. Unfortunately, using existing language models to generate a sentence with a given word is non-trivial: adding additional information [16, 19] about a word does not guarantee that the wanted word will appear; generic probabilistic samplers (e.g., Markov chain Monte Carlo methods) hardly applies to RNN language models111With recent efforts in .; setting an arbitrary word to be the wanted word damages the fluency of a sentence; imposing the constraint on the first word restricts the form of generated sentences.
In this paper, we propose a novel backward and forward (B/F) language model to tackle the problem of constrained natural language generation. Rather than generate a sentence from the first word to the last in sequence as in traditional models, we use RNNs to generate previous and subsequent words conditioned on the given word. The forward and backward generation can be accomplished either simultaneously or asynchronously, resulting in two variants, syn-B/F and asyn-B/F. In this way, our model is complete in theory for generating a sentence with a wanted word, which can appear at an arbitrary position in the sentence.
2 Background and Related Work
2.1 Language Modeling
Given a corpus , language modeling aims to minimize the joint distribution of , i.e. . Inspired by the observation that people always say a sentence from the beginning to the end, we would like to decompose the joint probability as222 is denoted as for short.
Parameterizing by multinomial distributions, we need to further simplify the above equation in order to estimate the parameters. Imposing a Markov assumption—a word is only dependent on previous words and independent of its position—results in the classic -gram model, where the joint probability is given by
To mitigate the data sparsity problem, a variety of smoothing methods have been proposed. We refer interested readers to textbooks like  for -gram models and their variants.
Bengio et al.  propose to use feed-forward neural networks to estimate the probability in Equation 2. In their model, a word is first mapped to a small dimensional vector, known as an embedding; then a feed-forward neural network propagates information to a softmax output layer, which estimates the probability of the next word.
A recurrent neural network (RNN) can also be used in language modeling. It keeps a hidden state vector ( at time ), dependent on the its previous state () and the current input vector , the word embedding of the current word. An output layer estimates the probability that each word occurs at this time slot. Following are listed the formulas for a vanilla RNN.333’s refer to weights; biases are omitted.
As is indicated from the equations, an RNN provides a means of direct parametrization of Equation 1, and hence has the ability to capture long term dependency, compared with -gram models. In practice, the vanilla RNN is difficult to train due to the gradient vanishing or exploding problem; long short term (LSTM) units  and gated recurrent units (GRU)  are proposed to better balance between the previous state and the current input.
2.2 Language Generation
Using RNNs to model the joint probability of language makes it feasible to generate new sentences. An early attempt generates texts by a character-level RNN language model ; recently, RNN-based language generation has made breakthroughs in several real applications.
The sequence to sequence machine translation model  uses an RNN to encode a source sentence (in foreign language) into one or a few fixed-size vectors; another RNN then decodes the vector(s) to the target sentence. Such network can be viewed as a language model, conditioned on the source sentence. At each time step, the RNN predicts the most likely word as the output; the embedding of the word is fed to the input layer at next step. The process continues until the RNN generates a special symbol eos indicating the end of the sequence. Beam search  or sampling methods  can be used to improve the quality and diversity of generated texts.
If the source sentence is too long to fit into one or a few fixed-size vectors, an attention mechanism  can be used to dynamically focus on different parts of the source sentence during target generation. In other studies, Wen et al. use an RNN to generate a sentence based on some abstract representations of semantics; they feed a one-hot vector, as additional information, to the RNN’s hidden layer . In a question answering system, Yin et al. leverage a soft logistic switcher to either generate a word from the vocabulary or copy the candidate answer .
3 The Proposed B/F Language Model
In this part, we introduce our B/F language model in detail. Our intuition is to seek a new approach to decompose the joint probability of a sentence (Equation 1). If we know a priori that a word should appear in the sentence (, ), it is natural to design a Bayesian network where is the root node, and other words are conditioned on . Following the spirit of “sequence” generation, split the sentence into two subsequences:
The probability that the sentence with the split word at position decomposes as follows.444 denotes the probability of a particular backward/forward sequence.
To parametrize the equation, we propose two model variants. The first approach is to generate previous and backward models simultaneously, and we call this syn-B/F language model (Figure 1).555Previously called backbone LM. Concretely, Equation 6 takes the form
where the factor refers to the conditional probability that current time step generates in the forward and backward sequences, respectively, given the middle part of the sentence, that is, . If one part has generated eos, we pad the special symbol eos for this sequence until the other part also terminates.
Here, is the hidden layer, which is dependent on the previous state and current input word embeddings . We use GRU  in our model, given by
where ; denotes element-wise product. and are known as gates, the candidate hidden state at the current step.
In the syn-B/F model, we design a single RNN to generate both chains in hope that each is aware of the other’s state. Besides, we also propose an asynchronous version, denoted as asyn-B/F (Figure 2). The idea is to first generate the backward sequence, and then feed the obtained result to another forward RNN to generate future words. The detailed formulas are not repeated.
It is important to notice that asyn-B/F’s RNN for backward sequence generation is different from a generic backward LM. The latter is presumed to model a sentence from the last word to the first one, whereas our backward RNN is, in fact, a “half” LM, starting from .
Training Criteria. If we assume is always given, the training criterion shall be the cross-entropy loss of all words in both chains except . We can alternatively penalize the split word in addition, which will make it possible to generate an entire sentence without being provided. We do not deem the two criteria differ significantly, and adopt the latter one in our experiments.
Both labeled and unlabeled datasets suffice to train the B/F language model. If a sentence is annotated with a specially interesting word , it is natural to use it as the split word. For an unlabeled dataset, we can randomly choose a word as .
Notice that Equation 6 gives the joint probability of a sentence with a particular split word . To compute the probability of the sentence, we shall marginalize out different split words, i.e.,
In our scenarios, however, we always assume that is given in practice. Hence, different from language modeling in general, the joint probability of a sentence is not the number one concern in our model.
4.1 The Dataset and Settings
To evaluate our B/F LMs, we prefer a vertical domain corpus with interesting application nuggets instead of using generic texts like Wikipedia. In particular, we chose to build a language model upon scientific paper titles on arXiv.666http://arxiv.org Building a language model on paper titles may help researchers when they are preparing their drafts. Provided a topic (designated by a given word), constrained natural language generation could also acts as a way of brainstorming.777The title of this paper is NOT generated by our model.
We crawled computer science-related paper titles from January 2014 to November 2015.888Crawled from http://http://dblp.uni-trier.de/db/journals/corr/ Each word was decapitalized, but no stemming was performed. Rare words ( occurrences) were grouped as a single token, unk, (referring to unknown). We removed non-English titles, and those with more than three unk’s. We notice that unk’s may appear frequently, but a large number of them refer to acronyms, and thus are mostly consistent in semantics.
Currently, we have 25,000 samples for training, 1455 for validation and another 1455 for testing; our vocabulary size is 3380. The asyn-B/F has one hidden layer with 100 units; syn-B/F has 200; This makes a fair comparison because syn-B/F should simultaneously learn implicit forward and backward LMs, which are completely different. In our models, embeddings are 50 dimensional, initialized randomly. To train the model, we used standard backpropagation (batch size 50) with element-wise gradient clipping. Following , we applied rmsprop for optimization (embeddings excluded), which is more suitable for training RNNs than naïve stochastic gradient descent, and less sensitive to hyperparameters compared with momentum methods. Initial weights were uniformly sampled from . Initial learning rate was 0.002, with a multiplicative learning rate decay of 0.97, moving average decay 0.99, and a damping term . As word embeddings are sparse in use , they were optimized asynchronously by pure stochastic gradient descent with learning rate being divided by .999The implementation was based on [10, 11].
We first use the perplexity measure to evaluate the learned language models. Perplexity is defined as , where is the log-likelihood (with base 2), averaged over each word.
Note that eos is not considered when we compute the perplexity.
We compare our models with several baselines:
Sequential LM: A pure LM, which is not applicable to constrained sentence generation.
Info-all: Built upon sequential LM, Info-all takes the wanted word’s embedding as additional input at each time step during sequence generation, similar to .
Info-init: The wanted word’s embedding is only added at the first step (sequence to sequence model ).
Sep-B/F: We train two separate forward and backward LMs (both starting from the split word).
|Method||Overall PPL||First word’s PPL||Subsequent words’ PPL|
|sep-B/F ( oracle)||99.2||–||–|
|syn-B/F ( oracle)||97.5||–||–|
|asyn-B/F ( oracle)||89.8||–||–|
|deep convolutional neural networks for unk - based image segmentation||convolutional neural networks for unk - unk||deep convolutional neural networks||convolutional neural networks for unk -based object detection||convolutional neural networks for image classification|
|object tracking and unk for visual recognition||learning deep convolutional features for object tracking||object tracking||tracking - unk - based social media||unk - based unk detection for image segmentation|
|optimal control for unk systems with unk - type ii : a unk - unk approach||formal verification of unk - unk systems||optimal control for unk systems||systems - based synthesis for unk based diagnose||a new approach for the unk of the unk - free problem|
|unk : a new method for unk based on - line counting on unk||unk : a new approach for unk - based unk||unk : a survey||: a unk - based approach to unk - based deign of unk -based image retrieval||unk : a unk - based approach for unk - free grammar|
|an approach to unk the edge - preserving problem||an approach to unk||an approach to unk - unk||to unk : a unk - efficient and scalable framework for the unk of unk||a unk - based approach to unk for unk|
From the experimental results, we have the following observations.
All B/F variants yield a larger perplexity than a sequantial LM. This makes much sense because randomly choosing a split word increases uncertainly. It should be also noticed that, in our model, the perplexity reflects the probability of a sentence with a specific split word, whereas the perplexity of the sequential LM assesses the probability of a sentence itself.
Randomly choosing a split word cannot make use of position information in sentences. The titles of scientific papers, for example, oftentimes follow templates, which may begin with “unk : an approach” or “unk - based approach.” Therefore, sequential LM yields low perplexity when generating the word at a particular position (), but such information is smoothed out in our B/F LMs because the split word is chosen randomly.
When is large (e.g., ), B/F models yield almost the same perplexity as sequential LM. The long term behavior is similar to sequential LM, if we rule out the impact of choosing random words. For syn-B/F, in particular, the result indicates that feeding two words’ embeddings to the hidden layer does not add to confusion.
In our applications, is always given, which indicates (denoted as oracle in Table 1). This reduces the perplexity to less than 100, showing that our B/F LMs can well make use of such information that some word should appear in the generated text. Further, our syn-B/F is better than naïve sep-B/F; asyn-B/F is further capable of integrating information in backward and forward sequences.
We then generate new paper titles from the learned language model with a specific word being given, which can be thought of, in the application, as a particular interest of research topics. Table 2 illustrates examples generated from B/F models and baselines. As we see, for words that are common at the beginning of a paper title—like the adjective convolutional and gerund tracking—sequential LM can generate reasonable results. For plural nouns like systems and models, the titles generated by sequential LM are somewhat influent, but they basically comply with grammar rules. For words that are unlikely to be the initial word, sequential LM fails to generate grammatically correct sentences.
Adding additional information does guide the network to generate sentences relevant to the topic, but the wanted word may not appear. The problem is also addressed in .
By contrast, B/F LMs have the ability to generate correct sentences. But the sep-B/F model is too greedy in its each chain. As generating short and general texts is a known issue with neural network-based LMs, sep-B/F can hardly generate a sentence containing much substance. syn-B/F is better, and asyn-B/F is able to generate sentences whose quality is comparable with sequential LMs.
In this paper, we proposed a backward and forward language model (B/F LM) for constrained natural language generation. Given a particular word, our model can generate previous words and future words either synchronously or asynchronously. Experiments show a similar perplexity to sequential LM, if we disregard the perplexity introduced by random splitting. Our case study demonstrates that the asynchronous B/F LM can generate sentences that contain the given word and are comparable to sequential LM in quality.
-  D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations, 2015.
-  Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin. A neural probabilistic language model. The Journal of Machine Learning Research, 3:1137–1155, 2003.
-  M. Berglund, T. Raiko, M. Honkala, L. Kärkkäinen, A. Vetek, and J. T. Karhunen. Bidirectional recurrent neural networks as generative models. In Advances in Neural Information Processing Systems, pages 856–864, 2015.
-  S. F. Chen and J. Goodman. An empirical study of smoothing techniques for language modeling. In Proceedings of the 34th Annual Meeting on Association for Computational Linguistics, pages 310–318, 1996.
-  K. Cho, B. van Merriënboer, D. Bahdanau, and Y. Bengio. On the properties of neural machine translation: Encoder-decoder approaches. In Proceedings of Eighth Workshop on Syntax, Semtnatics and Structure in Statistical Translation.
-  R. Collobert and J. Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th International Conference on Machine Learning, pages 160–167, 2008.
-  S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
-  D. Jurafsky and J. H. Martin. Speech and Language Processing. Pearson, 2014.
-  A. Karpathy, J. Johnson, and F.-F. Li. Visualizing and understanding recurrent networks. arXiv preprint arXiv:1506.02078, 2015.
-  L. Mou, G. Li, L. Zhang, T. Wang, and Z. Jin. Convolutional neural networks over tree structures for programming language processing. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, 2016.
-  L. Mou, H. Peng, G. Li, Y. Xu, L. Zhang, and Z. Jin. Discriminative neural sentence modeling by tree-based convolution. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 2315–2325, 2015.
-  H. Peng, L. Mou, G. Li, Y. Chen, Y. Lu, and Z. Jin. A comparative study on regularization strategies for embedding-based neural networks. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 2106–2111, 2015.
-  A. M. Rush, S. Chopra, and J. Weston. A neural attention model for abstractive sentence summarization. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 379–389, 2015.
-  I. Sutskever, J. Martens, and G. E. Hinton. Generating text with recurrent neural networks. In Proceedings of the 28th International Conference on Machine Learning, pages 1017–1024, 2011.
-  I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, pages 3104–3112, 2014.
-  T.-H. Wen, M. Gasic, D. Kim, N. Mrksic, P.-H. Su, D. Vandyke, and S. Young. Stochastic language generation in dialogue using recurrent neural networks with convolutional sentence reranking. In Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 275–284, 2015.
-  Y. Xu, L. Mou, G. Li, Y. Chen, H. Peng, and Z. Jin. Classifying relations via long short term memory networks along shortest dependency paths. In Proceedings of Conference on Empirical Methods in Natural Language Processing, 2015.
-  K. Yao, G. Zweig, and B. Peng. Attention with intention for a neural network conversation model. arXiv preprint arXiv:1510.08565 (NIPS Workshop), 2015.
-  J. Yin, X. Jiang, Z. Lu, L. Shang, H. Li, and X. Li. Neural generative question answering. arXiv preprint arXiv:1512.01337, 2015.