Adversarial Training for Community Question Answer Selection
Based on Multi-scale Matching
Community-based question answering (CQA) websites represent an important source of information. As a result, the problem of matching the most valuable answers to their corresponding questions has become an increasingly popular research topic. We frame this task as a binary (relevant/irrelevant) classification problem, and propose a Multi-scale Matching model that inspects the correlation between words and ngrams (word-to-ngrams) of different levels of granularity. This is in addition to word-to-word correlations which are used in most prior work. In this way, our model is able to capture rich context information conveyed in ngrams, therefore can better differentiate good answers from bad ones. Furthermore, we present an adversarial training framework to iteratively generate challenging negative samples to fool the proposed classification model. This is completely different from previous methods, where negative samples are uniformly sampled from the dataset during training process. The proposed method is evaluated on SemEval 2017 and Yahoo Answer dataset and achieves state-of-the-art performance.
Adversarial Training for Community Question Answer Selection
Based on Multi-scale Matching
Xiao Yang Miaosen Wang Wei Wang Madian Khabsa Ahmed Awadallah Pennsylvania State University, Google, Microsoft, Apple email@example.com, firstname.lastname@example.org email@example.com, firstname.lastname@example.org, email@example.com
Community-based question answering (CQA) websites such as Yahoo Answer and Quora are important information and knowledge source. They allow users to submit questions and answers covering a wide range of topics. These websites often organize such user-generated content in the form of a question followed by a list of candidate answers. Over time, a large amount of crowd-sourced question/answer pairs has been accumulated, which can be leveraged to automatically answer a newly submitted question.
To fully make use of the information and knowledge stored in CQA systems, the community question answer selection task has received much attention recently. CQA selection task aims at automatically retrieving archived answers that are relevant to a newly submitted question. Since many users tend to submit new questions rather than searching existing questions [?], a large number of questions reoccur and they may already be answered by previous content. Several challenges exist for this task, among which lexical gap is a fundamental one that differentiate this task from other general-purpose information retrieval (IR) tasks. The term lexical gap describes the phenomenon that words or ngrams in questions may lead to causally related content in answers, rather than semantically related content such as synonymy. In other words, a sentence sharing many overlapping words with the question may not necessarily be a relevant answer.
To tackle the lexical gap problem, many methods explicitly model the correlations between text fragments in questions and answers, and frame this task as a binary classification problem. Under this setting, each question/answer pair is labeled as either relevant or non-relevant. Consequently, the CQA selection task can be approached by first predicting the relevance score for each candidate answer to a question, then re-ranking these answers to find the most appropriate one. The “matching-aggregating” framework [?; ?; ?] is a representative work of this line of research. It first represents each word by embeddings, then exhaustively compares each word in questions to that in answers. The comparison results are later aggregated by a feed-forward neural network to make final predictions. Various strategies have been proposed for aggregating comparisons, such as max-pooling method [?] and attention method [?]. Our work also follows this framework. However, different from most prior work which only considers word-to-word comparisons, we also examine comparisons between words and ngrams of different length. The rationale behind is that the semantic meaning of a text fragment is not the simple combination of the meanings of individual words [?]. By considering word-to-ngrams comparisons, our model is able to capture semantic information at different levels of granularity and utilize it to assist classification. To obtain word-to-ngrams comparisons, we employ a deep convolutional neural network (CNN) to learn a hierarchical representation for each sentence. Neurons at higher levels compress information of larger context. Representations at different levels of one sentence are then compared with those from the other sentence. In such way, comparisons from multiple levels of granularity are inspected.
When treating CQA selection as a binary classification task, a practical issue is how to construct negative samples. While a training set often provides initial positive samples (questions and their answers labeled as relevant) and negative samples (questions and their answers labeled as non-relevant), many researchers augment negative samples by randomly coupling a question with an answer from a different question thread. The underlying assumption is that answers from other questions are unlikely to be qualified as relevant to the current question. While such data augmentation can provide much more training samples, it also amplifies the problem of imbalanced class labels. Inspired by Generative Adversarial Nets (GANs) [?], we present an adversarial training strategy which employs a generative model to produce a relatively small number of high-quality negative samples. While most augmented negative sample by uniform sampling can be easily classified (e.g. the topics of answers are completely irrelevant to the questions), samples produced by a generative model are expected to be more challenging, therefore are more likely to fool the classification model. By alternately optimizing the generative model and the classification model, we can finally obtain a more robust and accurate classifier.
Our contributions are summarized as follows:
We extend current “matching-aggregating” framework for CQA selection task by considering matchings from multiple levels of granularity. Prior work only examines word-to-word comparisons, therefore can not capture semantic information conveyed in larger context.
We present an adversarial training strategy which employs a generative model to produce challenging negative samples. By alternately optimizing the generative model and the classification model, we are able to improve prediction accuracy.
The proposed model is evaluated on SemEval 2017 and Yahoo Answer datasets and achieves state-of-the-art performance.
2 Related Work
2.1 Community Question Answering
For an automatic community question answer selection system, two main tasks exist: (1) retrieving related questions to a newly submitted question [?; ?; ?; ?]; and (2) retrieving potentially relevant answers to a newly submitted question [?; ?; ?; ?; ?; ?; ?]. Successfully accomplishing the first task can assist the second task. However, it is not a must step. Techniques for CQA selection can be broadly categorized into three classes: (1) statistical translation models; (2) latent variable models; and (3) deep learning models.
Early work spent great efforts on statistical translation models, which take parallel corpora as input and learn correlations between words and phrase from one corpus and another. For example, [?; ?] use IBM translation model 1 to learn the translation probability between question and answer words. Later work has improved upon them by considering phrase-level correlations [?] and entity-level correlations [?]. The proposed Multi-scale Matching model shares similar idea by incorporating word-to-ngrams comparisons, however such comparisons are modeled by a deep neural network rather than translation probability matrix.
Another line of work explores the topic models for this task. Such approaches [?; ?; ?] usually learn the latent topics of questions and answers, under the assumption that a relevant answer should share a similar topic distribution of the question. Recently, these approaches have been combined with word embeddings [?; ?; ?] and translation models [?], which have led to further improvements of performance.
With the recent success of deep learning models in multiple natural language processing (NLP) tasks, researchers started to explore deep models for CQA. [?] proposed a neural network to predict the pairwise ranking of two candidate answers. [?] trained two auto-encoders for questions and answers respectively which share the intermediate semantic representation. Recently, a number of work has framed this task as a text classification problem, and proposed several deep neural network based models. For example, [?] first encode sentences into sentence embeddings, then predict the relationship between questions and answers based on the learned embeddings. However, such approaches ignore the low level interactions between words in sentences, therefore their performances are usually limited. Later, [?; ?; ?] proposed a matching-aggregating framework which first exhaustively compares words from one sentence to another, then aggregates the comparison results to make final predictions. Different matching strategies have been proposed, such as attentive matching [?], max-pooling matching [?], or a combination of various matching strategies [?]. The proposed Multi-scale Matching model also follows such framework, however we explore comparisons at multiple levels of granularity (word-to-word and word-to-ngrams).
2.2 Generative Adversarial Nets and NLP
Generative Adversarial Nets (GANs) [?] is first proposed for generating samples from a continuous space such as images. It consists of a generative model and a discriminative model . aims to fit the real data distribution and attempts to map a random noise (e.g. a random sample from a Gaussian distribution) to a real sample (e.g. an image). In contrary, attempts to differentiate real samples from fake ones generated by . During training, and are alternately optimized, forming a minimax game. A number of extensions to GAN have been proposed to achieve stable training and better visualization results for image generation.
The idea of adversarial training can also be applied to NLP tasks. Although such tasks often involve discrete sampling process which is not differentiable, researchers have proposed several solutions such as policy gradient [?; ?; ?] and Gumbel Softmax trick [?]. [?] proposed SeqGAN to generate sequence of words from noises. [?] adopted adversarial training to improve the robustness of a dialog generation system. A more relevant work to our method is IRGAN [?], which applied adversarial training to multiple information retrieval tasks. However, IRGAN models the relationship between two documents sorely based on the learned sentence embeddings, ignoring all low level interactions. In contrary, we explore comparisons at multiple levels of granularity, and use the aggregated comparison results to measure the relevance.
In this section, we first formally define the task of community question answer selection by framing it as a binary classification problem, then describe our Multi-scale Matching Model for classification. Finally, we present details about how to fit this model in an adversarial training framework.
Let and be the input question and answer sentence of length and , respectively. Let be a score function parameterized by that estimates the relevance between and . A higher value means that an answer is more relevant to the question. Given a question , its corresponding candidate answer set can be ranked based on the predicted relevance score. The top ranked answers will be selected as the correct answers. Therefore the answer selection task can be accomplished by solving a binary classification problem.
3.1 Multi-scale Matching Model
The goal of the proposed Multi-scale Matching Model is to estimate relevance score of a question/answer pair. Our model follows the “matching-aggregating” framework. Different from previous methods which only consider word-to-word matching, we also investigate the relation between word and ngrams of different lengths. In this way, the proposed model can leverage context information conveyed in ngrams, therefore better differentiates good answers from bad ones. The architecture of the proposed model is illustrated in Figure 2.
Word and Ngram Embeddings For either a question or answer sentence, we represent each word with a -dimensional real-valued vector. In this work, we use pre-trained word embeddings from GloVe [?], which has shown its effectiveness in multiple natural language processing tasks. For each sentence, our model learns a hierarchy of representations using a temporal convolutional neural network. Neurons at lower levels learn local semantic information, while neurons at higher levels compress context semantic information within their receptive field. More formally, for a sentence where each word is represented by the corresponding word embeddings, a series of convolution blocks are applied:
Here is the resulting feature maps after applying the -th () convolution block. A convolution block consists of a temporal convolution layers, followed by a batch normalization [?] layer, a rectified linear unit [?] layer and a max pooling layer. The kernel size of the temporal convolution layer is 3 and the number of output channels is 128. At the end, a hierarchy of feature representations is learned, which compresses the semantic information different levels of granularity. For example, represents the information of the -th word embedding, while represents the context information from a 5-gram since the receptive field is 5.
Similar process is applied to the answer sentence , resulting in another hierarchy of feature representations .
Multi-scale Matching and Aggregating For a specific pair of feature representations and , we can define a matching function to measure the relation between them. In the following part, we first describe how to realize the matching function , then define the score function based on the matching results , abbreviated as .
Multiple ways exist on how to realize the matching function , for example the max-pooling matching [?] and attentive matching [?]. Here we adopt the max-pooling matching method due to its simplicity. First, we compare each timestep in and using a function :
where the function is implemented by a two-layer feed-forward neural network. Note that both and are -dimensional vectors, and the output is also a vector. For each timestep in , we aggregate the comparison results by max-pooling and obtain a single vector :
where Pooling denotes element-wise max pooling. In other words, only the maximum value of each dimension is retained. Figure 2 shows a diagram of such aggregation process.
Similarly, for each timestep in , we can also aggregate the comparison results and obtain:
Now we have two sets of comparison vectors and , we can first aggregate over each set by summation:
then formulate the matching function as the concatenation of and :
Based on the defined the matching function , the score function can be formulated as:
where denotes the concatenation of all possible matching results for and , and is a real-value function. The function can be realized by a two-layer fully connected neural network. Equation 9 indicates that all possible word-to-word, word-to-ngram and ngram-to-ngram matchings are being considered. A simpler way is to formulate the score function as:
meaning that we only consider word-to-word and word-to-ngram matchings, and ignore all ngram-to-ngram matchings. It is also clear that the way in [?; ?; ?] is equivalent to formulating the score function as:
indicating that only word-to-word matching is examined.
In this work, we adopt the second way as described in Equation 10, since the first way (Equation 9) is computationally expensive and the third way (Equation 11) cannot utilize context information conveyed in ngrams. Results in Section 4 shows that the second way leads to a significant improvement compared with the third way.
3.2 Adversarial Training for Answer Selection
Generative Adversarial Nets Generative Adversarial Nets are first proposed by [?]. It consists of two “adversarial” models: a generative model (generator ) aiming at capturing real data distribution , and a discriminative model (discriminator ) that estimates the probability that a sample comes from the real training data rather than the generator. Both the generator and discriminator can be implemented by non-linear mapping functions, such as feed-forward neural networks.
The discriminator is optimized to maximize the probability of assigning the correct labels to either training samples or the generated ones. On the other hand, the generator is optimized to maximize the probability of making a mistake, or equivalently to minimize . Therefore, the overall objective can be summarized as:
where the generative model is written as . During training, we alternately minimize and maximize the same objective function to learn the generator and the discriminator , respectively.
Adversarial Training for Answer Selection Inspired by the idea of GANs, we propose an adversarial training framework which uses a Multi-scale Matching model to produce high-quality adversarial samples and another Multi-scale Matching model to differentiate positive samples from negative ones. In parallel to terminologies used in GANs literature, we will call these two models generator and discriminator respectively.
In the context of answer selection, the generator aims to capture real data distribution and generate (or select) relevant answers conditioned on the question sentence . In contrary, the discriminator attempts to distinguish between relevant and irrelevant answers depending on . In other words, the discriminator is simply a binary classifier. Formally, the objective function in Equation 3.2 can be rewritten as:
We now describe how to use the proposed Multi-scale Matching model to build our discriminator and generator. Since the proposed model can be seen as a score function, which measures how relevant an answer is to a question , we can directly feed the relevance into a sigmoid function to build our discriminative model. The generator attempts to fit the underlying real data distribution, and based on that, randomly samples an answer from the whole answer set in order to fool the discriminator. In order to model the “sampling according to a probability distribution” process, we employ another Multi-scale Matching model as a score function and evaluate it on every candidate answer. Afterwards, answers with high relevance scores will be sampled with high probabilities. In other words, we would like to select negative answers from the whole set which is more relevant to according to the current generator.
Formally, given a set of candidate answers of a specific question , the discriminative model and the generative model is modeled by:
with being a sigmoid function and being a temperature hyper-parameter. As the temperature approaches 0, samples from are more likely to come from those with high relevance scores. is set to 2 in our experiments.
Ideally, the score function needs to be evaluated on each possible answer. However, the actual size of such answer set can be very large, making the summation intractable. To address this issue, in practice we first uniformly sample an alternative answer set whose size is much smaller (e.g. 100) compared with all possible answers. Such set is constituted by answers from two sources: (1) labeled negative answers for question ; and (2) answers from other question . Since irrelevant answers are far more than relevant answers, the resulting set is unlikely to contain any false negatives. Then we evaluate the score function on each answer in , and randomly sample adversarial answers according to the probability calculated in Equation 14.
The original GANs require that both the generator and the discriminator are fully differentiable, so that a gradient-based optimization algorithm can be applied. However, this is not true in our case due to the random sampling step involved in the generator. A number of approaches have been proposed to tackle this problem, such as policy gradient [?; ?; ?] and Gumbel Softmax trick [?]. Here we adopt the policy gradient approach. As can be seen in Equation 3.2, the objective function for optimizing is expressed as minimizing the expectation of a function evaluated on samples from a probability distribution. Therefore, using the REINFORCE [?] algorithm, the gradient of with respect to ’s parameters can be derived as:
where in the last step the expectation is approximated by sampling. From a reinforcement learning point of view, the term is the received reward when a policy takes an action of choosing answer .
In this section, we evaluate the proposed method on two benchmark datasets: SemEval 2017 and Yahoo Answers. Ablation experiments are conducted on both datasets to demonstrate the effectiveness of the proposed Multi-scale Matching model and the adversarial training strategy.
4.1 Datasets and Evaluation
SemEval 2017 dataset is used for SemEval 2017 Task 3 Subtask C (Question-External Comment Similarity). It contains 317 original questions from Qatar Living website. Each question is associated with the first 10 related questions (retrieved by a search engine) and their corresponding top 10 answers appearing in the thread. As a result, each question is associated with 100 answers, and the ultimate goal is to re-rank these 100 answers according to their relevance to the question. The train/develop/test set split is provided by the dataset.
Evaluation Metrics On SemEval 2017 dataset, we use the official evaluation measure for the competition which is mean average precision (MAP) calculated over the top 10 ranked answers. We also report mean reciprocal rank (MRR), which is another widely-used information retrieval measure. On Yahoo Answer dataset, precision at top 1 rank position (Prec@1) is reported to compare with other methods.
For a single question, Average Precision (AP) is the average of precisions obtained for the top ranked answers. This value is then averaged over different questions to yield MAP. Formally, MAP is defined as:
where denotes the total number of relevant answers in top results and represents the top ranked answers for question .
Mean Reciprocal Rank (MRR) is defined as:
with being the rank position of the first relevant answer for the -th question.
Precision at top 1 rank position (Prec@1) evaluates the precision of choosing the best answer at the first rank position. It first calculate this value for each question, then averaged over all questions:
where denotes whether the first ranked result is relevant to question or not.
4.2 Results on SemEval 2017
Table 1 summarizes the results of different methods on SemEval 2017 dataset (numbers are extracted from [?]). For our methods, “single” denotes that we only consider word-to-word matchings as in Equation 11, while “multi” means that we consider both word-to-word and word-to-ngrams matchings as in Equation 10. “adversarial” means that we employ an additional generative model to produce challenging adversarial samples to fool the discriminative model during training. From the table we can see that using Multi-scale Matching consistently improves the performance. With only a discriminative model and no adversarial training, the mean average precision is increased from 14.67 to 14.92. With adversarial training, the mean average precision is increased from 15.48 to 15.61. Furthermore, with adversarial training, both our single-scale and multi-scale models outperform previous methods. This demonstrates the effectiveness of utilizing a generative model to produce challenging negative samples. Since in SemEval 2017 dataset each question is associated with a relatively large number (100) of candidate answers, it is likely that many of them can be easily classified. Therefore, it is beneficial to have challenging adversarial samples during training in order to get a robust discriminator/classifier.
Table 2 shows an example to demonstrate the effectiveness of our proposed method. When asking a question “Is it possible to sponsor husband? Is possible, then can he able to work in Qatar?”, Ours (single) model ranks an irrelevant question sentence as the top-1 answer. While this model finds a sentence that shares several common words with the question, it fails to capture the question-answer structure. Each of Ours (multi), Ours (single+adversarial) and Ours (multi+adversarial) models finds a declarative sentence as the top-1 answer, and the ground-truth labels are either potentially useful or relevant. All these three models successfully capture the question-answer structure, for example, by relating “is it …?” to “he can…”. The superior or inferior among these three models is not clearly distinguishable, but it seems that the models with adversarial training prefer more detailed explanations such as “letter of no objection” and “approval”. Meanwhile, we can see that the generators in adversarial training find irrelevant sentences (associated with questions other than ) that share many common words (e.g. sponsor and wife) with the question. These sentences are more challenging negative samples than simply randomly choosing sentences from other questions.
In this work, we framed the community question answer selection task as a binary classification problem, and proposed a Multi-scale Matching model which is able to explore context information. Different from previous methods in matching-aggregating framework which only consider word-to-word matching, we examine matchings from multiple levels of granularity. Furthermore, inspired by the Generative Adversarial Nets (GANs), we presented an adversarial training strategy which uses a generative model to produce challenging negative samples during training. After alternately optimizing the generative model and the discriminative model, we are able to obtain a more robust classifier. The proposed method is evaluated on two benchmark datasets: SemEval 2017 and Yahoo Answers and achieved state-of-the-art performance. Future work would investigate the stability of GAN training which remains an open research question, especially when discrete sampling is involved.
|Our (single+adversarial, )||
|Our (multi+adversarial, )||
- [Cai et al., 2011] Li Cai, Guangyou Zhou, Kang Liu, and Jun Zhao. Learning the latent topics for question retrieval in community qa. In IJCNLP, volume 11, pages 273–281, 2011.
- [Deepak et al., 2017] P Deepak, Dinesh Garg, and Shirish Shevade. Latent space embedding for retrieval in question-answer archives. In EMNLP, pages 855–865, 2017.
- [Filice et al., 2017] Simone Filice, Giovanni Da San Martino, and Alessandro Moschitti. Kelp at semeval-2017 task 3: Learning pairwise patterns in community question answering. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 326–333, 2017.
- [Glorot et al., 2011] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural networks. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 315–323, 2011.
- [Goodfellow et al., 2014] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, pages 2672–2680, 2014.
- [Ioffe and Szegedy, 2015] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, pages 448–456, 2015.
- [Jang et al., 2016] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016.
- [Jeon et al., 2005] Jiwoon Jeon, W Bruce Croft, and Joon Ho Lee. Finding similar questions in large question and answer archives. In CIKM, pages 84–90. ACM, 2005.
- [Ji et al., 2012] Zongcheng Ji, Fei Xu, Bin Wang, and Ben He. Question-answer topic model for question retrieval in community question answering. In CIKM, pages 2471–2474. ACM, 2012.
- [Koreeda et al., 2017] Yuta Koreeda, Takuya Hashito, Yoshiki Niwa, Misa Sato, Toshihiko Yanase, Kenzo Kurotsuchi, and Kohsuke Yanai. bunji at semeval-2017 task 3: Combination of neural similarity features and comment plausibility features. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 353–359, 2017.
- [Le and Mikolov, 2014] Quoc Le and Tomas Mikolov. Distributed representations of sentences and documents. In ICML, pages 1188–1196, 2014.
- [Li et al., 2016] Jiwei Li, Will Monroe, Alan Ritter, Michel Galley, Jianfeng Gao, and Dan Jurafsky. Deep reinforcement learning for dialogue generation. arXiv preprint arXiv:1606.01541, 2016.
- [Lu and Li, 2013] Zhengdong Lu and Hang Li. A deep architecture for matching short texts. In NIPS, pages 1367–1375, 2013.
- [Nakov et al., 2016] Preslav Nakov, Lluís Màrquez, and Francisco Guzmán. It takes three to tango: Triangulation approach to answer ranking in community question answering. In EMNLP, pages 1586–1597, 2016.
- [Nakov et al., 2017] Preslav Nakov, Doris Hoogeveen, Lluís Màrquez, Alessandro Moschitti, Hamdy Mubarak, Timothy Baldwin, and Karin Verspoor. Semeval-2017 task 3: Community question answering. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 27–48, 2017.
- [Nandi et al., 2017] Titas Nandi, Chris Biemann, Seid Muhie Yimam, Deepak Gupta, Sarah Kohail, Asif Ekbal, and Pushpak Bhattacharyya. Iit-uhh at semeval-2017 task 3: Exploring multiple features for community question answering and implicit dialogue identification. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 90–97, 2017.
- [Parikh et al., 2016] Ankur P Parikh, Oscar Täckström, Dipanjan Das, and Jakob Uszkoreit. A decomposable attention model for natural language inference. arXiv preprint arXiv:1606.01933, 2016.
- [Pennington et al., 2014] Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In EMNLP, pages 1532–1543, 2014.
- [Shen et al., 2015] Yikang Shen, Wenge Rong, Zhiwei Sun, Yuanxin Ouyang, and Zhang Xiong. Question/answer matching for cqa system via combining lexical and sequential information. In AAAI, pages 275–281, 2015.
- [Shen et al., 2017] Yikang Shen, Wenge Rong, Nan Jiang, Baolin Peng, Jie Tang, and Zhang Xiong. Word embedding based correlation model for question/answer matching. In AAAI, pages 3511–3517, 2017.
- [Singh, 2012] Amit Singh. Entity based q&a retrieval. In EMNLP, pages 1266–1277. Association for Computational Linguistics, 2012.
- [Stubbs, 2001] Michael Stubbs. Words and phrases: Corpus studies of lexical semantics. Blackwell publishers Oxford, 2001.
- [Surdeanu et al., 2008] Mihai Surdeanu, Massimiliano Ciaramita, and Hugo Zaragoza. Learning to rank answers on large online qa collections. In ACL, volume 8, pages 719–727, 2008.
- [Sutton et al., 2000] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In NIPS, pages 1057–1063, 2000.
- [Tan et al., 2015] Ming Tan, Cicero dos Santos, Bing Xiang, and Bowen Zhou. Lstm-based deep learning models for non-factoid answer selection. arXiv preprint arXiv:1511.04108, 2015.
- [Tian et al., 2017] Junfeng Tian, Zhiheng Zhou, Man Lan, and Yuanbin Wu. Ecnu at semeval-2017 task 1: Leverage kernel-based traditional nlp features and neural networks to build a universal model for multilingual and cross-lingual semantic textual similarity. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 191–197, 2017.
- [Wang and Jiang, 2016] Shuohang Wang and Jing Jiang. A compare-aggregate model for matching text sequences. arXiv preprint arXiv:1611.01747, 2016.
- [Wang et al., 2017a] Jun Wang, Lantao Yu, Weinan Zhang, Yu Gong, Yinghui Xu, Benyou Wang, Peng Zhang, and Dell Zhang. Irgan: A minimax game for unifying generative and discriminative information retrieval models. 2017.
- [Wang et al., 2017b] Zhiguo Wang, Wael Hamza, and Radu Florian. Bilateral multi-perspective matching for natural language sentences. arXiv preprint arXiv:1702.03814, 2017.
- [Xie et al., 2017] Yufei Xie, Maoquan Wang, Jing Ma, Jian Jiang, and Zhao Lu. Eica team at semeval-2017 task 3: Semantic and metadata-based features for community question answering. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 292–298, 2017.
- [Xue et al., 2008] Xiaobing Xue, Jiwoon Jeon, and W Bruce Croft. Retrieval models for question and answer archives. In ACM SIGIR, pages 475–482. ACM, 2008.
- [Yu et al., 2017] Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. Seqgan: Sequence generative adversarial nets with policy gradient. In AAAI, pages 2852–2858, 2017.
- [Zhang et al., 2017a] Sheng Zhang, Jiajun Cheng, Hui Wang, Xin Zhang, Pei Li, and Zhaoyun Ding. Furongwang at semeval-2017 task 3: Deep neural networks for selecting relevant answers in community question answering. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 320–325, 2017.
- [Zhang et al., 2017b] Xiaodong Zhang, Sujian Li, Lei Sha, and Houfeng Wang. Attentive interactive neural networks for answer selection in community question answering. In AAAI, pages 3525–3531, 2017.
- [Zhou et al., 2012] Tom Chao Zhou, Michael R Lyu, and Irwin King. A classification-based approach to question routing in community question answering. In WWW, pages 783–790. ACM, 2012.
- [Zhou et al., 2015] Guangyou Zhou, Tingting He, Jun Zhao, and Po Hu. Learning continuous word embedding with metadata for question retrieval in community question answering. In ACL, pages 250–259, 2015.
- [Zhou et al., 2016] Guangyou Zhou, Yin Zhou, Tingting He, and Wensheng Wu. Learning semantic representation with neural networks for community question answering retrieval. Knowledge-Based Systems, 93:75–83, 2016.