Adversarial Training for Community Question Answer Selection
Based on Multiscale Matching
Abstract
Communitybased question answering (CQA) websites represent an important source of information. As a result, the problem of matching the most valuable answers to their corresponding questions has become an increasingly popular research topic. We frame this task as a binary (relevant/irrelevant) classification problem, and propose a Multiscale Matching model that inspects the correlation between words and ngrams (wordtongrams) of different levels of granularity. This is in addition to wordtoword correlations which are used in most prior work. In this way, our model is able to capture rich context information conveyed in ngrams, therefore can better differentiate good answers from bad ones. Furthermore, we present an adversarial training framework to iteratively generate challenging negative samples to fool the proposed classification model. This is completely different from previous methods, where negative samples are uniformly sampled from the dataset during training process. The proposed method is evaluated on SemEval 2017 and Yahoo Answer dataset and achieves stateoftheart performance.
Adversarial Training for Community Question Answer Selection
Based on Multiscale Matching
Xiao Yang Miaosen Wang Wei Wang Madian Khabsa Ahmed Awadallah Pennsylvania State University, Google, Microsoft, Apple xuy111@psu.edu, miaosen@outlook.com wei.wang@microsoft.com, madian@apple.com, hassanam@microsoft.com
1 Introduction
Communitybased question answering (CQA) websites such as Yahoo Answer and Quora are important information and knowledge source. They allow users to submit questions and answers covering a wide range of topics. These websites often organize such usergenerated content in the form of a question followed by a list of candidate answers. Over time, a large amount of crowdsourced question/answer pairs has been accumulated, which can be leveraged to automatically answer a newly submitted question.
To fully make use of the information and knowledge stored in CQA systems, the community question answer selection task has received much attention recently. CQA selection task aims at automatically retrieving archived answers that are relevant to a newly submitted question. Since many users tend to submit new questions rather than searching existing questions [?], a large number of questions reoccur and they may already be answered by previous content. Several challenges exist for this task, among which lexical gap is a fundamental one that differentiate this task from other generalpurpose information retrieval (IR) tasks. The term lexical gap describes the phenomenon that words or ngrams in questions may lead to causally related content in answers, rather than semantically related content such as synonymy. In other words, a sentence sharing many overlapping words with the question may not necessarily be a relevant answer.
To tackle the lexical gap problem, many methods explicitly model the correlations between text fragments in questions and answers, and frame this task as a binary classification problem. Under this setting, each question/answer pair is labeled as either relevant or nonrelevant. Consequently, the CQA selection task can be approached by first predicting the relevance score for each candidate answer to a question, then reranking these answers to find the most appropriate one. The “matchingaggregating” framework [?; ?; ?] is a representative work of this line of research. It first represents each word by embeddings, then exhaustively compares each word in questions to that in answers. The comparison results are later aggregated by a feedforward neural network to make final predictions. Various strategies have been proposed for aggregating comparisons, such as maxpooling method [?] and attention method [?]. Our work also follows this framework. However, different from most prior work which only considers wordtoword comparisons, we also examine comparisons between words and ngrams of different length. The rationale behind is that the semantic meaning of a text fragment is not the simple combination of the meanings of individual words [?]. By considering wordtongrams comparisons, our model is able to capture semantic information at different levels of granularity and utilize it to assist classification. To obtain wordtongrams comparisons, we employ a deep convolutional neural network (CNN) to learn a hierarchical representation for each sentence. Neurons at higher levels compress information of larger context. Representations at different levels of one sentence are then compared with those from the other sentence. In such way, comparisons from multiple levels of granularity are inspected.
When treating CQA selection as a binary classification task, a practical issue is how to construct negative samples. While a training set often provides initial positive samples (questions and their answers labeled as relevant) and negative samples (questions and their answers labeled as nonrelevant), many researchers augment negative samples by randomly coupling a question with an answer from a different question thread. The underlying assumption is that answers from other questions are unlikely to be qualified as relevant to the current question. While such data augmentation can provide much more training samples, it also amplifies the problem of imbalanced class labels. Inspired by Generative Adversarial Nets (GANs) [?], we present an adversarial training strategy which employs a generative model to produce a relatively small number of highquality negative samples. While most augmented negative sample by uniform sampling can be easily classified (e.g. the topics of answers are completely irrelevant to the questions), samples produced by a generative model are expected to be more challenging, therefore are more likely to fool the classification model. By alternately optimizing the generative model and the classification model, we can finally obtain a more robust and accurate classifier.
Our contributions are summarized as follows:

We extend current “matchingaggregating” framework for CQA selection task by considering matchings from multiple levels of granularity. Prior work only examines wordtoword comparisons, therefore can not capture semantic information conveyed in larger context.

We present an adversarial training strategy which employs a generative model to produce challenging negative samples. By alternately optimizing the generative model and the classification model, we are able to improve prediction accuracy.

The proposed model is evaluated on SemEval 2017 and Yahoo Answer datasets and achieves stateoftheart performance.
2 Related Work
2.1 Community Question Answering
For an automatic community question answer selection system, two main tasks exist: (1) retrieving related questions to a newly submitted question [?; ?; ?; ?]; and (2) retrieving potentially relevant answers to a newly submitted question [?; ?; ?; ?; ?; ?; ?]. Successfully accomplishing the first task can assist the second task. However, it is not a must step. Techniques for CQA selection can be broadly categorized into three classes: (1) statistical translation models; (2) latent variable models; and (3) deep learning models.
Early work spent great efforts on statistical translation models, which take parallel corpora as input and learn correlations between words and phrase from one corpus and another. For example, [?; ?] use IBM translation model 1 to learn the translation probability between question and answer words. Later work has improved upon them by considering phraselevel correlations [?] and entitylevel correlations [?]. The proposed Multiscale Matching model shares similar idea by incorporating wordtongrams comparisons, however such comparisons are modeled by a deep neural network rather than translation probability matrix.
Another line of work explores the topic models for this task. Such approaches [?; ?; ?] usually learn the latent topics of questions and answers, under the assumption that a relevant answer should share a similar topic distribution of the question. Recently, these approaches have been combined with word embeddings [?; ?; ?] and translation models [?], which have led to further improvements of performance.
With the recent success of deep learning models in multiple natural language processing (NLP) tasks, researchers started to explore deep models for CQA. [?] proposed a neural network to predict the pairwise ranking of two candidate answers. [?] trained two autoencoders for questions and answers respectively which share the intermediate semantic representation. Recently, a number of work has framed this task as a text classification problem, and proposed several deep neural network based models. For example, [?] first encode sentences into sentence embeddings, then predict the relationship between questions and answers based on the learned embeddings. However, such approaches ignore the low level interactions between words in sentences, therefore their performances are usually limited. Later, [?; ?; ?] proposed a matchingaggregating framework which first exhaustively compares words from one sentence to another, then aggregates the comparison results to make final predictions. Different matching strategies have been proposed, such as attentive matching [?], maxpooling matching [?], or a combination of various matching strategies [?]. The proposed Multiscale Matching model also follows such framework, however we explore comparisons at multiple levels of granularity (wordtoword and wordtongrams).
2.2 Generative Adversarial Nets and NLP
Generative Adversarial Nets (GANs) [?] is first proposed for generating samples from a continuous space such as images. It consists of a generative model and a discriminative model . aims to fit the real data distribution and attempts to map a random noise (e.g. a random sample from a Gaussian distribution) to a real sample (e.g. an image). In contrary, attempts to differentiate real samples from fake ones generated by . During training, and are alternately optimized, forming a minimax game. A number of extensions to GAN have been proposed to achieve stable training and better visualization results for image generation.
The idea of adversarial training can also be applied to NLP tasks. Although such tasks often involve discrete sampling process which is not differentiable, researchers have proposed several solutions such as policy gradient [?; ?; ?] and Gumbel Softmax trick [?]. [?] proposed SeqGAN to generate sequence of words from noises. [?] adopted adversarial training to improve the robustness of a dialog generation system. A more relevant work to our method is IRGAN [?], which applied adversarial training to multiple information retrieval tasks. However, IRGAN models the relationship between two documents sorely based on the learned sentence embeddings, ignoring all low level interactions. In contrary, we explore comparisons at multiple levels of granularity, and use the aggregated comparison results to measure the relevance.
3 Method
In this section, we first formally define the task of community question answer selection by framing it as a binary classification problem, then describe our Multiscale Matching Model for classification. Finally, we present details about how to fit this model in an adversarial training framework.
Let and be the input question and answer sentence of length and , respectively. Let be a score function parameterized by that estimates the relevance between and . A higher value means that an answer is more relevant to the question. Given a question , its corresponding candidate answer set can be ranked based on the predicted relevance score. The top ranked answers will be selected as the correct answers. Therefore the answer selection task can be accomplished by solving a binary classification problem.
3.1 Multiscale Matching Model
The goal of the proposed Multiscale Matching Model is to estimate relevance score of a question/answer pair. Our model follows the “matchingaggregating” framework. Different from previous methods which only consider wordtoword matching, we also investigate the relation between word and ngrams of different lengths. In this way, the proposed model can leverage context information conveyed in ngrams, therefore better differentiates good answers from bad ones. The architecture of the proposed model is illustrated in Figure 2.
Word and Ngram Embeddings For either a question or answer sentence, we represent each word with a dimensional realvalued vector. In this work, we use pretrained word embeddings from GloVe [?], which has shown its effectiveness in multiple natural language processing tasks. For each sentence, our model learns a hierarchy of representations using a temporal convolutional neural network. Neurons at lower levels learn local semantic information, while neurons at higher levels compress context semantic information within their receptive field. More formally, for a sentence where each word is represented by the corresponding word embeddings, a series of convolution blocks are applied:
(1)  
(2) 
Here is the resulting feature maps after applying the th () convolution block. A convolution block consists of a temporal convolution layers, followed by a batch normalization [?] layer, a rectified linear unit [?] layer and a max pooling layer. The kernel size of the temporal convolution layer is 3 and the number of output channels is 128. At the end, a hierarchy of feature representations is learned, which compresses the semantic information different levels of granularity. For example, represents the information of the th word embedding, while represents the context information from a 5gram since the receptive field is 5.
Similar process is applied to the answer sentence , resulting in another hierarchy of feature representations .
Multiscale Matching and Aggregating For a specific pair of feature representations and , we can define a matching function to measure the relation between them. In the following part, we first describe how to realize the matching function , then define the score function based on the matching results , abbreviated as .
Multiple ways exist on how to realize the matching function , for example the maxpooling matching [?] and attentive matching [?]. Here we adopt the maxpooling matching method due to its simplicity. First, we compare each timestep in and using a function :
(3) 
where the function is implemented by a twolayer feedforward neural network. Note that both and are dimensional vectors, and the output is also a vector. For each timestep in , we aggregate the comparison results by maxpooling and obtain a single vector :
(4) 
where Pooling denotes elementwise max pooling. In other words, only the maximum value of each dimension is retained. Figure 2 shows a diagram of such aggregation process.
Similarly, for each timestep in , we can also aggregate the comparison results and obtain:
(5) 
Now we have two sets of comparison vectors and , we can first aggregate over each set by summation:
(6)  
(7) 
then formulate the matching function as the concatenation of and :
(8) 
Based on the defined the matching function , the score function can be formulated as:
(9) 
where denotes the concatenation of all possible matching results for and , and is a realvalue function. The function can be realized by a twolayer fully connected neural network. Equation 9 indicates that all possible wordtoword, wordtongram and ngramtongram matchings are being considered. A simpler way is to formulate the score function as:
(10) 
meaning that we only consider wordtoword and wordtongram matchings, and ignore all ngramtongram matchings. It is also clear that the way in [?; ?; ?] is equivalent to formulating the score function as:
(11) 
indicating that only wordtoword matching is examined.
In this work, we adopt the second way as described in Equation 10, since the first way (Equation 9) is computationally expensive and the third way (Equation 11) cannot utilize context information conveyed in ngrams. Results in Section 4 shows that the second way leads to a significant improvement compared with the third way.
3.2 Adversarial Training for Answer Selection
Generative Adversarial Nets Generative Adversarial Nets are first proposed by [?]. It consists of two “adversarial” models: a generative model (generator ) aiming at capturing real data distribution , and a discriminative model (discriminator ) that estimates the probability that a sample comes from the real training data rather than the generator. Both the generator and discriminator can be implemented by nonlinear mapping functions, such as feedforward neural networks.
The discriminator is optimized to maximize the probability of assigning the correct labels to either training samples or the generated ones. On the other hand, the generator is optimized to maximize the probability of making a mistake, or equivalently to minimize . Therefore, the overall objective can be summarized as:
(12) 
where the generative model is written as . During training, we alternately minimize and maximize the same objective function to learn the generator and the discriminator , respectively.
Adversarial Training for Answer Selection Inspired by the idea of GANs, we propose an adversarial training framework which uses a Multiscale Matching model to produce highquality adversarial samples and another Multiscale Matching model to differentiate positive samples from negative ones. In parallel to terminologies used in GANs literature, we will call these two models generator and discriminator respectively.
In the context of answer selection, the generator aims to capture real data distribution and generate (or select) relevant answers conditioned on the question sentence . In contrary, the discriminator attempts to distinguish between relevant and irrelevant answers depending on . In other words, the discriminator is simply a binary classifier. Formally, the objective function in Equation 3.2 can be rewritten as:
(13) 
We now describe how to use the proposed Multiscale Matching model to build our discriminator and generator. Since the proposed model can be seen as a score function, which measures how relevant an answer is to a question , we can directly feed the relevance into a sigmoid function to build our discriminative model. The generator attempts to fit the underlying real data distribution, and based on that, randomly samples an answer from the whole answer set in order to fool the discriminator. In order to model the “sampling according to a probability distribution” process, we employ another Multiscale Matching model as a score function and evaluate it on every candidate answer. Afterwards, answers with high relevance scores will be sampled with high probabilities. In other words, we would like to select negative answers from the whole set which is more relevant to according to the current generator.
Formally, given a set of candidate answers of a specific question , the discriminative model and the generative model is modeled by:
(14)  
(15) 
with being a sigmoid function and being a temperature hyperparameter. As the temperature approaches 0, samples from are more likely to come from those with high relevance scores. is set to 2 in our experiments.
Ideally, the score function needs to be evaluated on each possible answer. However, the actual size of such answer set can be very large, making the summation intractable. To address this issue, in practice we first uniformly sample an alternative answer set whose size is much smaller (e.g. 100) compared with all possible answers. Such set is constituted by answers from two sources: (1) labeled negative answers for question ; and (2) answers from other question . Since irrelevant answers are far more than relevant answers, the resulting set is unlikely to contain any false negatives. Then we evaluate the score function on each answer in , and randomly sample adversarial answers according to the probability calculated in Equation 14.
The original GANs require that both the generator and the discriminator are fully differentiable, so that a gradientbased optimization algorithm can be applied. However, this is not true in our case due to the random sampling step involved in the generator. A number of approaches have been proposed to tackle this problem, such as policy gradient [?; ?; ?] and Gumbel Softmax trick [?]. Here we adopt the policy gradient approach. As can be seen in Equation 3.2, the objective function for optimizing is expressed as minimizing the expectation of a function evaluated on samples from a probability distribution. Therefore, using the REINFORCE [?] algorithm, the gradient of with respect to ’s parameters can be derived as:
(16) 
where in the last step the expectation is approximated by sampling. From a reinforcement learning point of view, the term is the received reward when a policy takes an action of choosing answer .
4 Experiments
In this section, we evaluate the proposed method on two benchmark datasets: SemEval 2017 and Yahoo Answers. Ablation experiments are conducted on both datasets to demonstrate the effectiveness of the proposed Multiscale Matching model and the adversarial training strategy.
4.1 Datasets and Evaluation
SemEval 2017 dataset is used for SemEval 2017 Task 3 Subtask C (QuestionExternal Comment Similarity). It contains 317 original questions from Qatar Living website. Each question is associated with the first 10 related questions (retrieved by a search engine) and their corresponding top 10 answers appearing in the thread. As a result, each question is associated with 100 answers, and the ultimate goal is to rerank these 100 answers according to their relevance to the question. The train/develop/test set split is provided by the dataset.
Evaluation Metrics On SemEval 2017 dataset, we use the official evaluation measure for the competition which is mean average precision (MAP) calculated over the top 10 ranked answers. We also report mean reciprocal rank (MRR), which is another widelyused information retrieval measure. On Yahoo Answer dataset, precision at top 1 rank position (Prec@1) is reported to compare with other methods.
For a single question, Average Precision (AP) is the average of precisions obtained for the top ranked answers. This value is then averaged over different questions to yield MAP. Formally, MAP is defined as:
(17) 
where denotes the total number of relevant answers in top results and represents the top ranked answers for question .
Mean Reciprocal Rank (MRR) is defined as:
(18) 
with being the rank position of the first relevant answer for the th question.
Precision at top 1 rank position (Prec@1) evaluates the precision of choosing the best answer at the first rank position. It first calculate this value for each question, then averaged over all questions:
(19) 
where denotes whether the first ranked result is relevant to question or not.
Method  MAP  MRR 
ECNU [?]  10.64  11.09 
FuRongWang [?]  13.23  14.27 
EICA [?]  13.48  16.04 
KeLP [?]  14.35  16.07 
bunji [?]  14.71  16.48 
IITUHH [?]  15.46  18.14 
Ours (single)  14.67  18.45 
Ours (multi)  14.92  19.07 
Ours (single+adversarial)  17.38  19.53 
Ours (multi+adversarial)  17.79  19.73 
4.2 Results on SemEval 2017
Table 1 summarizes the results of different methods on SemEval 2017 dataset (numbers are extracted from [?]). For our methods, “single” denotes that we only consider wordtoword matchings as in Equation 11, while “multi” means that we consider both wordtoword and wordtongrams matchings as in Equation 10. “adversarial” means that we employ an additional generative model to produce challenging adversarial samples to fool the discriminative model during training. From the table we can see that using Multiscale Matching consistently improves the performance. With only a discriminative model and no adversarial training, the mean average precision is increased from 14.67 to 14.92. With adversarial training, the mean average precision is increased from 15.48 to 15.61. Furthermore, with adversarial training, both our singlescale and multiscale models outperform previous methods. This demonstrates the effectiveness of utilizing a generative model to produce challenging negative samples. Since in SemEval 2017 dataset each question is associated with a relatively large number (100) of candidate answers, it is likely that many of them can be easily classified. Therefore, it is beneficial to have challenging adversarial samples during training in order to get a robust discriminator/classifier.
4.3 Examples
Table 2 shows an example to demonstrate the effectiveness of our proposed method. When asking a question “Is it possible to sponsor husband? Is possible, then can he able to work in Qatar?”, Ours (single) model ranks an irrelevant question sentence as the top1 answer. While this model finds a sentence that shares several common words with the question, it fails to capture the questionanswer structure. Each of Ours (multi), Ours (single+adversarial) and Ours (multi+adversarial) models finds a declarative sentence as the top1 answer, and the groundtruth labels are either potentially useful or relevant. All these three models successfully capture the questionanswer structure, for example, by relating “is it …?” to “he can…”. The superior or inferior among these three models is not clearly distinguishable, but it seems that the models with adversarial training prefer more detailed explanations such as “letter of no objection” and “approval”. Meanwhile, we can see that the generators in adversarial training find irrelevant sentences (associated with questions other than ) that share many common words (e.g. sponsor and wife) with the question. These sentences are more challenging negative samples than simply randomly choosing sentences from other questions.
5 Conclusions
In this work, we framed the community question answer selection task as a binary classification problem, and proposed a Multiscale Matching model which is able to explore context information. Different from previous methods in matchingaggregating framework which only consider wordtoword matching, we examine matchings from multiple levels of granularity. Furthermore, inspired by the Generative Adversarial Nets (GANs), we presented an adversarial training strategy which uses a generative model to produce challenging negative samples during training. After alternately optimizing the generative model and the discriminative model, we are able to obtain a more robust classifier. The proposed method is evaluated on two benchmark datasets: SemEval 2017 and Yahoo Answers and achieved stateoftheart performance. Future work would investigate the stability of GAN training which remains an open research question, especially when discrete sampling is involved.
Question 


Our (single) 


Our (multi) 


Our (single+adversarial) 


Our (multi+adversarial) 


Our (single+adversarial, ) 


Our (multi+adversarial, ) 

References
 [Cai et al., 2011] Li Cai, Guangyou Zhou, Kang Liu, and Jun Zhao. Learning the latent topics for question retrieval in community qa. In IJCNLP, volume 11, pages 273–281, 2011.
 [Deepak et al., 2017] P Deepak, Dinesh Garg, and Shirish Shevade. Latent space embedding for retrieval in questionanswer archives. In EMNLP, pages 855–865, 2017.
 [Filice et al., 2017] Simone Filice, Giovanni Da San Martino, and Alessandro Moschitti. Kelp at semeval2017 task 3: Learning pairwise patterns in community question answering. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval2017), pages 326–333, 2017.
 [Glorot et al., 2011] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural networks. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 315–323, 2011.
 [Goodfellow et al., 2014] Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, pages 2672–2680, 2014.
 [Ioffe and Szegedy, 2015] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, pages 448–456, 2015.
 [Jang et al., 2016] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbelsoftmax. arXiv preprint arXiv:1611.01144, 2016.
 [Jeon et al., 2005] Jiwoon Jeon, W Bruce Croft, and Joon Ho Lee. Finding similar questions in large question and answer archives. In CIKM, pages 84–90. ACM, 2005.
 [Ji et al., 2012] Zongcheng Ji, Fei Xu, Bin Wang, and Ben He. Questionanswer topic model for question retrieval in community question answering. In CIKM, pages 2471–2474. ACM, 2012.
 [Koreeda et al., 2017] Yuta Koreeda, Takuya Hashito, Yoshiki Niwa, Misa Sato, Toshihiko Yanase, Kenzo Kurotsuchi, and Kohsuke Yanai. bunji at semeval2017 task 3: Combination of neural similarity features and comment plausibility features. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval2017), pages 353–359, 2017.
 [Le and Mikolov, 2014] Quoc Le and Tomas Mikolov. Distributed representations of sentences and documents. In ICML, pages 1188–1196, 2014.
 [Li et al., 2016] Jiwei Li, Will Monroe, Alan Ritter, Michel Galley, Jianfeng Gao, and Dan Jurafsky. Deep reinforcement learning for dialogue generation. arXiv preprint arXiv:1606.01541, 2016.
 [Lu and Li, 2013] Zhengdong Lu and Hang Li. A deep architecture for matching short texts. In NIPS, pages 1367–1375, 2013.
 [Nakov et al., 2016] Preslav Nakov, Lluís Màrquez, and Francisco Guzmán. It takes three to tango: Triangulation approach to answer ranking in community question answering. In EMNLP, pages 1586–1597, 2016.
 [Nakov et al., 2017] Preslav Nakov, Doris Hoogeveen, Lluís Màrquez, Alessandro Moschitti, Hamdy Mubarak, Timothy Baldwin, and Karin Verspoor. Semeval2017 task 3: Community question answering. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval2017), pages 27–48, 2017.
 [Nandi et al., 2017] Titas Nandi, Chris Biemann, Seid Muhie Yimam, Deepak Gupta, Sarah Kohail, Asif Ekbal, and Pushpak Bhattacharyya. Iituhh at semeval2017 task 3: Exploring multiple features for community question answering and implicit dialogue identification. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval2017), pages 90–97, 2017.
 [Parikh et al., 2016] Ankur P Parikh, Oscar Täckström, Dipanjan Das, and Jakob Uszkoreit. A decomposable attention model for natural language inference. arXiv preprint arXiv:1606.01933, 2016.
 [Pennington et al., 2014] Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In EMNLP, pages 1532–1543, 2014.
 [Shen et al., 2015] Yikang Shen, Wenge Rong, Zhiwei Sun, Yuanxin Ouyang, and Zhang Xiong. Question/answer matching for cqa system via combining lexical and sequential information. In AAAI, pages 275–281, 2015.
 [Shen et al., 2017] Yikang Shen, Wenge Rong, Nan Jiang, Baolin Peng, Jie Tang, and Zhang Xiong. Word embedding based correlation model for question/answer matching. In AAAI, pages 3511–3517, 2017.
 [Singh, 2012] Amit Singh. Entity based q&a retrieval. In EMNLP, pages 1266–1277. Association for Computational Linguistics, 2012.
 [Stubbs, 2001] Michael Stubbs. Words and phrases: Corpus studies of lexical semantics. Blackwell publishers Oxford, 2001.
 [Surdeanu et al., 2008] Mihai Surdeanu, Massimiliano Ciaramita, and Hugo Zaragoza. Learning to rank answers on large online qa collections. In ACL, volume 8, pages 719–727, 2008.
 [Sutton et al., 2000] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In NIPS, pages 1057–1063, 2000.
 [Tan et al., 2015] Ming Tan, Cicero dos Santos, Bing Xiang, and Bowen Zhou. Lstmbased deep learning models for nonfactoid answer selection. arXiv preprint arXiv:1511.04108, 2015.
 [Tian et al., 2017] Junfeng Tian, Zhiheng Zhou, Man Lan, and Yuanbin Wu. Ecnu at semeval2017 task 1: Leverage kernelbased traditional nlp features and neural networks to build a universal model for multilingual and crosslingual semantic textual similarity. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval2017), pages 191–197, 2017.
 [Wang and Jiang, 2016] Shuohang Wang and Jing Jiang. A compareaggregate model for matching text sequences. arXiv preprint arXiv:1611.01747, 2016.
 [Wang et al., 2017a] Jun Wang, Lantao Yu, Weinan Zhang, Yu Gong, Yinghui Xu, Benyou Wang, Peng Zhang, and Dell Zhang. Irgan: A minimax game for unifying generative and discriminative information retrieval models. 2017.
 [Wang et al., 2017b] Zhiguo Wang, Wael Hamza, and Radu Florian. Bilateral multiperspective matching for natural language sentences. arXiv preprint arXiv:1702.03814, 2017.
 [Xie et al., 2017] Yufei Xie, Maoquan Wang, Jing Ma, Jian Jiang, and Zhao Lu. Eica team at semeval2017 task 3: Semantic and metadatabased features for community question answering. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval2017), pages 292–298, 2017.
 [Xue et al., 2008] Xiaobing Xue, Jiwoon Jeon, and W Bruce Croft. Retrieval models for question and answer archives. In ACM SIGIR, pages 475–482. ACM, 2008.
 [Yu et al., 2017] Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. Seqgan: Sequence generative adversarial nets with policy gradient. In AAAI, pages 2852–2858, 2017.
 [Zhang et al., 2017a] Sheng Zhang, Jiajun Cheng, Hui Wang, Xin Zhang, Pei Li, and Zhaoyun Ding. Furongwang at semeval2017 task 3: Deep neural networks for selecting relevant answers in community question answering. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval2017), pages 320–325, 2017.
 [Zhang et al., 2017b] Xiaodong Zhang, Sujian Li, Lei Sha, and Houfeng Wang. Attentive interactive neural networks for answer selection in community question answering. In AAAI, pages 3525–3531, 2017.
 [Zhou et al., 2012] Tom Chao Zhou, Michael R Lyu, and Irwin King. A classificationbased approach to question routing in community question answering. In WWW, pages 783–790. ACM, 2012.
 [Zhou et al., 2015] Guangyou Zhou, Tingting He, Jun Zhao, and Po Hu. Learning continuous word embedding with metadata for question retrieval in community question answering. In ACL, pages 250–259, 2015.
 [Zhou et al., 2016] Guangyou Zhou, Yin Zhou, Tingting He, and Wensheng Wu. Learning semantic representation with neural networks for community question answering retrieval. KnowledgeBased Systems, 93:75–83, 2016.