Want a Good Answer? Ask a Good Question First!

Want a Good Answer? Ask a Good Question First!

Abstract

Community Question Answering (CQA) websites have become valuable repositories which host a massive volume of human knowledge. To maximize the utility of such knowledge, it is essential to evaluate the quality of an existing question or answer, especially soon after it is posted on the CQA website.

In this paper, we study the problem of inferring the quality of questions and answers through a case study of a software CQA (Stack Overflow). Our key finding is that the quality of an answer is strongly positively correlated with that of its question. Armed with this observation, we propose a family of algorithms to jointly predict the quality of questions and answers, for both quantifying numerical quality scores and differentiating the high-quality questions/answers from those of low quality. We conduct extensive experimental evaluations to demonstrate the effectiveness and efficiency of our methods.

[Data mining]

1Introduction

Community Question Answering (CQA) websites have become valuable repositories which host a massive volume of human knowledge. In addition to providing answers to the questioner, CQA websites now serve as knowledge bases for the searching and browsing of a much larger audience. For example, in the software forum Stack Overflow, programmers could post their programming questions on the forum, and others could propose their answers for these questions. Such questions as well as their associated answers could be valuable and reusable for many other programmers who encounter similar problems. In fact, millions of programmers now use Stack Overflow to search for high-quality answers to their programming questions, and Stack Overflow has also become a knowledge base for people to learn programming skills by browsing high-quality questions and answers [16].

To maximize the utility of CQA websites, predicting the quality of existing questions and answers, especially soon after they are posted, becomes an essential task for both information producers and consumers. From the perspective of information producer (e.g., who asks or answers questions), predicting the quality as early as possible could help the questions that are potentially of high quality to attract more high-quality answers by recommending these questions to experts. From the perspective of information consumer (e.g., who searches or browses questions and answers), it would be helpful to highlight the question-answer pairs with high quality (e.g., by displaying them more prominently on the website or allowing the search engine to be aware of their quality), so that users can easily discover them. In addition to recognizing high-quality posts, quality prediction is also useful to detect useless posts or even spams. For example, Stack Overflow has launched a contest to predict whether a question will be closed due to quality problems at the moment the question is posted 1.

To date, a lot of efforts have been made to study the quality prediction problem in CQA websites. However, most of them treat questions and answers separately (See Section 5 for a review). For example, some work evaluates the question quality to recommend questions or to better match the user’s query (e.g. [19]), while some work evaluates the answer quality to re-rank the returned answers from the search engine (e.g. [9]).

We conjecture that there might be correlation between the quality of a question and that of its associated answers. Intuitively, an interesting question might obtain more attention from potential answerers and thus has a better chance to receive high-quality answers. On the other hand, it might be very difficult for a low-quality question to attract a high-quality answer due to, e.g., its poor expression in language, lack of interest in topics, etc.

Starting from this conjecture, we study in this paper the relationship between the quality of questions and that of answers through a case study of a software CQA (Stack Overflow). Our key finding is that the quality of an answer is indeed strongly positively correlated with that of its question. Armed with this observation, we propose a family of algorithms (CoPs) to jointly predict the quality of questions and answers. The proposed CoPs algorithms can be applied to both the continuous case (e.g., to infer numerical quality scores) and the binary case (e.g., to differentiate the high quality questions/answers from those of low quality). We conduct extensive experimental evaluations to demonstrate the effectiveness and efficiency of our methods. By collectively exploring the features, the predicted quality, and the link between them for both questions and answers simultaneously, our method achieves up to 13.13% improvement over the state-of-the-art methods wrt prediction error. In addition, our method scales linearly wrt the number of questions and answers.

In summary, this paper makes the following three main contributions:

  • Findings. We empirically study the post quality in Stack Overflow. To the best of our knowledge, we are the first to quantitatively validate the correlation between the question quality and its associated answer quality.

  • Algorithms. We propose a family of co-prediction algorithms CoPs to jointly predict the quality of questions and answers for both quantifying numerical quality scores and differentiating the high-quality questions/answers from those of low quality .

  • Evaluations. Extensive experimental evaluations on the Stack Overflow dataset demonstrate the effectiveness and efficiency of our methods.

The rest of the paper is organized as follows. Section 2 empirically studies the Stack Overflow dataset. Section 3 and 4 present the problem definitions and the proposed algorithms for the question/answer quality co-prediction problem, respectively. Section 5 presents the experimental results. Section 6 reviews related work. Section 7 discusses the future work, and Section 8 concludes the paper.

2Dataset Description and Analysis

In this section, we describe the Stack Overflow dataset and our pre-processing steps on the dataset, followed by an empirically study of the question/answer quality in Stack Overflow.

2.1Dataset Description and Pre-processing

Stack Overflow, which is ranked among reference websites and among all websites 2, is a popular CQA for software developers to ask and answer programming questions. The Stack Overflow dataset that we use spans from July 31, 2008 to August 31, 2011, and it is officially published and publicly available 3.

We performed the following pre-processing steps on the raw data. First of all, we clean up the dataset by deleting the data where some important attributes are missing (e.g., some answers do not indicate which questions they belong to). After this step, the dataset contains 756,695 users, 1,966,272 questions, 4,282,570 answers, 14,056,000 votes, 1,055,500 favorites, and 7,306,298 comments. In the existing questions, 1,815,140 (92.3%) of them have at least one answer. We only use these answered questions because our focus is on the quality correlation between questions and their answers. Next, as our primary goal is to predict the quality in the early stage of the questions/answers, we only consider the available information in the first 24 hours after the question is posted, and this step results in 3,577,041 answers (which means 83.5% of the answers arrive in the first 24 hours). Our analysis is based on these 1,815,140 questions and their 3,577,041 associated answers.

2.2Empirical Findings

Table 1: Quality distribution over the divided bins.
Bins A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11
Q1 4.58% 23.56% 23.10% 22.60% 11.03% 5.99% 3.42% 4.70% 1.03% 0.01% 0.01%
Q2 1.97% 27.71% 28.12% 23.19% 9.80% 4.47% 2.19% 2.33% 0.21% 0.00% 0.00%
Q3 1.96% 18.18% 24.64% 26.35% 13.33% 6.85% 3.64% 4.53% 0.52% 0.00% 0.00%
Q4 2.05% 13.88% 19.98% 25.86% 15.26% 8.86% 5.36% 7.56% 1.18% 0.00% 0.00%
Q5 2.13% 11.80% 17.24% 23.35% 15.40% 10.03% 6.61% 11.23% 2.21% 0.00% 0.00%
Q6 2.26% 10.66% 15.64% 21.32% 14.59% 10.21% 7.03% 14.47% 3.82% 0.00% 0.00%
Q7 2.34% 9.75% 14.78% 20.07% 13.81% 9.72% 7.32% 16.56% 5.63% 0.01% 0.00%
Q8 2.43% 8.86% 13.13% 18.32% 12.58% 9.05% 6.78% 18.30% 10.54% 0.03% 0.00%
Q9 2.51% 6.94% 10.48% 15.41% 10.83% 7.86% 5.96% 16.81% 22.06% 1.06% 0.08%
Q10 2.26% 4.86% 7.49% 12.39% 8.78% 7.10% 5.14% 15.93% 26.10% 7.00% 2.94%
Q11 2.43% 3.21% 5.39% 8.73% 6.91% 6.18% 4.05% 14.97% 28.62% 8.03% 11.49%

Based on the above pre-processing steps, we now empirically study the post quality in Stack Overflow. We use the score of the post (which is the difference between the number of up-votes and the number of down-votes on the post) as the quality indicator, and first show the quality distribution of questions and answers in Fig. ?. The x-axis is of log scale, and the non-positive scores are omitted from the figures. As we can see, both question quality and answer quality follow the power-law distribution. This result means that a large portion of posts in Stack Overflow receive little attention, e.g., less than 5 votes.

Next, we study the overall correlation between the quality of questions and that of their answers in Fig. ?, where the Pearson correlation coefficient is also computed. In Fig. ?, we consider the question quality with both the average answer quality and the maximum answer quality, since there could be multiple answers for one question. As we can see from the figures, the quality of questions and their answers are strongly correlated in both cases.

To gain a deeper understanding of the quality correlation, we further divide questions and answers into several bins based on their scores. Specially, questions and answers are divided into 11 bins (i,e, , and ) where the score in each bin is in the rage of , respectively.

As mentioned before, a large portion of questions and answers are of low score (e.g., around 40% question scores are 0, and around 25% question scores are 1). Such a skewed distribution may mislead the quality prediction algorithms. For instance, if we simply predict everything as low quality, we would get a high prediction accuracy. However, such an algorithm would miss all the high-quality questions/answers. To address this issue, we take an additional pre-processing step to make the distribution more even by deleting some low-quality posts. Specially, we randomly cut two-third of the 0-score questions in and half of the 1-score questions in ; and then also randomly cut two-third of the 0-score answers in and half of the 1-score answers in as well as the suspending answers whose corresponding questions are deleted. After this step, we have 1,064,142 questions and 1,554,045 answers, and the quality distribution over bins is shown in Fig. ?. As we can see, the distribution over bins is more balanced now compared to the power-law distribution. The positive quality correlation still exists in this dataset (e.g., the Pearson correlation coefficient is in the average case), and we will use this dataset as our final dataset in the experiments.

Based on the divided bins, we next study the distribution of answers over each question bin. That is, for questions in bin , we show the percentage of associated answers over the 11 answer bins in Table 1. Several interesting observations can be found from the table. First, the major questions and answers are in the diagonal, which supports the positive correlation between question quality and answer quality. Second, the table is very sparse in the top-right corner, which verifies that it is rare to have high-quality answers when the question is of low quality. Finally, the major high-quality answers are in the right-bottom corner of the table, which means their questions are also of high quality. However, there are a few very high-quality answers for the questions with negative quality scores (i.e., questions in ). This is probably because the answers successfully identify the vulnerability of the questions and thus gain support from the community members.

Overall, we conclude from Table 1 as well as the above analysis that the positive quality correlation between questions and their answers strongly exists in Stack Overflow.

3Quality Prediction: Problem Definitions

Table 2: Symbols.
Symbol Definition and Description
the feature matrix for questions and answers
the feature matrix with transferred features
the real quality of questions and answers
the coefficients for the features
the association matrix of questions and answers
the row-normalized matrix of M
the transpose of matrix
the element at the row and column of
the row of matrix
the number of questions and answers
the maximum iteration number
the threshold to terminate the iteration

In this section, we present the problem definition for quality prediction of questions/answers in CQA websites. Before that, let us first introduce the notations that would be used throughout the paper. The notations are listed in Table 2. Following conventions, we use bold capital letters for matrices, and bold lower case letters for vectors. For example, we use to denote the feature matrix for questions where each line contains the feature vector for the corresponding question. We use the matrix to denote the association matrix of questions and answers where non-zero element indicates that the answer belongs to the question. Similar to Matlab, we also denote the row of matrix as , and the transpose of a matrix with a prime.

Based on the above notations, the prediction problem for questions could be defined as follows:

Analogous definition for answer quality prediction can be defined, and we omit the definition for brevity. As mentioned in the previous section, we consider the answers as well as the available information as soon as the question is posted. In this work, we fix the time window to 24 hours, and therefore the feature matrices and should only contain the information that is available in the first 24 hours after the question is posted.

One of our key findings in the previous section is the strong positive quality correlation between questions and their answers. As a result, we take the association matrix into account, and predict the question and answer quality jointly. Based on the problem definition above, we now define our quality co-prediction problem as follows:

As we can see from the above problem definition, compared to the separate prediction problem, the only additional input of Problem ? is the association matrix . We will present the methods to solve Problem ? and Problem ? in the next section.

4Quality Prediction: The Proposed Approach

In this section, we present our algorithms to solve the problems defined in the previous section. We start with the baseline algorithms which predict the question quality and answer quality separately. Then, we present our co-prediction methods (CoPs) to jointly predict the question and answer quality, for both the regression case (e.g., to infer numerical quality scores) and the classification case (e.g., to differentiate the high-quality questions/answers from those of low quality).

4.1Preliminaries: Separate Methods

Here, we first introduce the separate method which could be used for the quality prediction in Problem ?. Specially, the problem can be formulated as the following optimization problem:

where indicates the loss function. We also add a regularization term which is controlled by to avoid over-fitting.

In the above formulation, various loss functions could be used. For example, for the purpose of classification, we may choose logistic function for ; for regression, we can choose the square loss, i.e., . Notice that in the latter case, the formulation becomes the standard ridge regression problem and it can be solved by the following closed-form solution:

Analogous solution for answer quality prediction can be similarly written. We denote such method as Separate method, and will compare it with our co-prediction methods in the experiments.

4.2Co-Prediction: Basic Strategies

Next, we discuss some basic strategies that we explore to leverage the observed quality correlation, before we present our CoPs algorithms. There are three basic strategies behind our CoPs algorithms. We first summarize them together with the intuitions behind each strategy as follows:

  • S1:

    The first strategy considers the feature space. Because the quality of questions and their answers is correlated, the features for question prediction are potentially useful for answer prediction. As a result, we transfer the question features to and add these features for answer quality prediction. Namely, we use to represent the new feature matrix for answers. Similarly, we transfer the answer features to and incorporate these features with as . Notice that we use the row-normalized matrix in the latter case. In other words, for a question with multiple answers, we take the average of the features from these answers.

  • S2:

    The second strategy takes into account the link between the feature space and the label space. That is, we could iteratively use the estimated question score as a feature for answer prediction, and the estimated answer score as a feature for question prediction. The intuition behind this strategy is that if the estimated quality score is close to the real quality score , then would also be high-correlated with .

  • S3:

    The third strategy considers the label space. For a pair of question and answer, we could directly maximize the quality correlation or minimize the quality difference between them. In this work, we try to minimize the difference between the predicted score of a question and that of its answer. Namely, we require that , where we also constrain that the question score is close to the average score of its answers. We study the maximum answer score as well, and similar results are observed. For brevity, we will focus on the average case in this work.

With these three strategies, we will focus on the regression problem (i.e., to infer numerical quality scores) and the classification problem (i.e., to differentiate the high quality questions/answers from those of low quality) in the next two subsections, respectively.

4.3Proposed Approach for Regression

Here, we consider the continuous case of the co-prediction problem where we want to infer numerical scores for each question and their associated answers. We propose CoPs-Iter which is based on S1 and S2. The detail of the proposed CoPs-Iter is shown in Alg. ?. As we can see in the algorithm, CoPs-Iter first incorporates the transferred features, estimates the answer quality , and then starts iteration. In each iteration, CoPs-Iter alternatively uses and as features for the question quality prediction and answer quality prediction, respectively. We will stop the iteration when the norm between successive estimates of both and is below our threshold , or the maximum iteration number is achieved.

4.4Proposed Approach for Classification

Here, we consider the binary case of the co-prediction problem where want to differentiate the high-quality questions/answers from those of low quality. We first show the general procedure for solving the co-prediction problem followed by a special case, and then briefly analyze the effectiveness and efficiency of the algorithms.

The General Procedure

The proposed approach is based on S1 and S3. To be specific, we propose a new optimization formulation for question and answer quality co-prediction problem. That is, after transferring the features, we add the constraint as an additional term into the optimization formulation:

where indicates the loss function of the additional quality correlation term, and is parameter to control the importance of this term. Compared to CoPs-Iter, an advantage of the above formulation is that it can deal with the sparsity problem when some of the quality scores are not available. For example, in the extreme case when there are no answer scores available, Eq. can still predict the scores of new answers due to the constraint of the quality correlation term. We will experimentally evaluate this in Section 5.

Next, let us further discuss the loss functions and in Eq.. In addition to the square loss we mentioned before, other loss functions could be applied under our classification setting. The intuition is that while the square loss tends to minimize the difference between the real quality score and the estimated quality score, the consistency between the real quality label and the estimated label is also important for the classification task. Therefore, we divide the posts into two classes: high quality with label +1 and low quality with label -1, and then consider the sigmoid loss function:

By setting and as square loss and sigmoid loss, we could derive four variants from Eq. as shown in Table 3.

Table 3: The four variants derived from Eq. .
Denoter
CoPs-QQ square loss square loss
CoPs-QG square loss sigmoid loss
CoPs-GG sigmoid loss sigmoid loss
CoPs-GQ sigmoid loss square loss

Finally, Alg. ? shows the skeleton of the general algorithms to solve the four variants in Table 3. The details of step 6-7 with various loss functions are presented in the appendix for completeness. As we can see from the algorithm skeleton, after initializing and , we adopt the batch gradient descent method to iteratively update the coefficients. We will stop the iteration when the norm between successive estimates of both and is below our threshold , or the maximum iteration number is achieved.

A Special Case

The optimization problem in Eq. is non-convex in general. However, if we set both and as square loss, the optimization problem becomes convex and we can have the closed-form solution for CoPs-QQ as follows:

Algorithm Analysis

Here, we briefly analyze the effectiveness and efficiency of our algorithms for the optimization problem in Eq. .

The effectiveness of the proposed algorithms can be summarized in Lemma ?, which says that the algorithms in Alg. ? can find a local minima solution, and Eq. can find the global minima solution.

Omitted for brevity.

The time complexity of the proposed CoPs is summarized in Lemma ?, which says that CoPs scales linearly wrt the sum of questions and answers.

Omitted for brevity.

5Experiments

In this section, we present the experimental evaluations. All the experiments are designed to answer the following questions:

  • Effectiveness: How accurate are the proposed methods for question and answer quality prediction?

  • Efficiency: How do the proposed methods scale?

5.1Experiment Setup

We first describe the features we use for our question quality prediction and answer quality prediction, and the selected features are shown in Table 4 and Table 5, respectively. Notice that our goal is not to find the most useful features for separate question/answer quality prediction, but to show that the quality correlation between questions and their answers could really improve the prediction accuracy. Therefore, as we can see from the tables, we choose some commonly used features in Q&A quality prediction. For example, questioners’ reputation could reflect their ability of posting a high-quality question, questioners could be more professional as they post more questions, etc. The number of comments and the length of the post are also widely used by existing methods. Notice that we do not include the vote information in selected features because the quality score is defined as the difference between up-votes and down-votes in Stack Overflow.

Table 4: Selected features for question quality prediction.
Feature Description
Questioner’s reputation when the question is posted
# of Questioner’s previous questions when the question is posted
# of answers received in 24 hours after the question is posted
# of favorites received in 24 hours after the question is posted
# of comments received in 24 hours after the question is posted
The length of the question
The length of the title
Table 5: Selected features for answer quality prediction.
Feature Description
Answerer’s reputation when the answer is posted
# of Answerer’s previous answers when the answer is posted
# of comments received in 24 hours after its question is posted
The length of the answer

Now, we already have the input matrices and based on the selected features, and the next input is the quality score. Here, in the regression setting (i.e., to infer continuous numerical quality scores), we normalize the quality scores between 0 and 1. In the classification setting (i.e., to predict the binary labels for high-quality and low-quality questions/answers), we classify the posts whose quality score is equal to or smaller than 0 as low-quality posts, and those whose quality score is equal to or larger than 5 as high-quality posts.

In terms of evaluation metrics, we adopt the root mean square error (RMSE) and the prediction error between the real quality and the estimated quality in the continuous case and the binary case, respectively. Specially, we will evaluate effectiveness of CoPs-Iter in the regression setting, and study the four variants in Table 3 (i.e., CoPs-QQ, CoPs-QG,CoPs-GG, and CoPs-GQ) in the classification setting. Unless otherwise stated, for all the results reported here, we fix , , and .

In addition to the Separate method we outline in section 4.1, we are not aware of any existing work for joint prediction of question and answer quality, with the only exception of the CQA-MR method [5]. CQA-MR aims to simultaneously classify the user reputation, and the quality of answers and questions in a mutually reinforced manner. We would like to point out that although both CQA-MR and our CoPs aim to improve classification accuracy by co-prediction, there are two important differences. First, while CQA-MR implements the co-prediction through the label space (by propagating the labels through user-question-answer graph), our CoPs does so through both label space and feature space, as well as the link between them (see S1-S3 in Section 4.2). Second, our CoPs finds either a local or a global minima for Eq. (depending on the specific loss functions). In contrast, CQA-MR alternates between propagating label and maximizing the corresponding conditional likelihood. Consequently, it is not clear what overall cost function CQA-MR aims to optimize and whether or not the overall procedure converges. As we will show below, such subtle differences in our CoPs lead to a significant performance improvement.

5.2Effectiveness Results

For effectiveness, we choose questions and their associated answers as the training set, and use the rest as the test set. All the effectiveness results reported are the average of 10 experiments.

Figure 1: The RMSE results of CoPs-Iter in the regression setting. CoPs-Iter outperforms the Separate method.
Figure 1: The RMSE results of CoPs-Iter in the regression setting. CoPs-Iter outperforms the Separate method.

(A) The Effectiveness of CoPs in Regression Setting. We first compare the RMSE results of CoPs-Iter with the Separate method. We set , and show the results in Figure 1. As we can see from the left part of the figure, although the proposed CoPs-Iter method is better wrt RMSE, the improvement is very limited. This is probably due to the fact that still a large amount of the normalized quality scores are in the range of . As a result, we further apply the CoPs-Iter method on the low-quality posts and high-quality posts only, and the results are also shown in Figure 2 (denoted as ‘HL posts only’). In this case, CoPs-Iter improves the Separate method by 4.50% in question prediction and 2.47% in answer prediction.

Table 6: The prediction error results of CoPs-QQCoPs-QG, CoPs-GG  and CoPs-GQ in the classification setting. Our methods outperform the Separate method and the CQA-MR method.
Questions Answers
Separate 0.2192 0.3396
CQA-MR 0.2360 0.3416
CoPs-QQ 0.2048 0.3010
CoPs-QG 0.2101 0.2941
CoPs-GG 0.2111 0.2939
CoPs-GQ 0.2029 0.2950

(B) The Effectiveness of CoPs in Classification Setting. In the classification experiment, we first compare our four variants derived from Eq. with the Separate method and the CQA-MR method. We still use only 1% questions and their associated answers as the training set, and show the results in Table 6. As we can see, all our four methods outperform the compared methods. For example, CoPs-GQ improves the Separate method by 7.44% and 13.13% wrt prediction error of questions and answers, respectively.

Next, we vary the percentage of questions and answers in the training set, and show the prediction error over in Fig. ?. As we can see, overall, our four methods are still better than the compared methods. In addition, even when the available labels are very sparse (i.e., only 0.01% questions and their associated answers are used as the training set), our methods can still predict the quality of questions/answers accurately. As to the comparison among the four methods we proposed, the CoPs-QQ and CoPs-GQ methods are a little better than the other two methods for the question quality prediction. For answer quality prediction, the four methods are all close to each other wrt prediction error.

Table 7: Performance gain wrt prediction error of the CoPs-QQ method. Both the transferred features and the quality correlation term in Eq. help to lower prediction error.
Questions Answers
Separate 0.2266 0.3531
Separate + transferred features 0.2172 0.3188
CoPs-QQ 0.2096 0.3149

Next, we take our CoPs-QQ method as an example, and show where the performance gain comes from in Table 7. Here, we fix . As we can see from the table, both the transferred features and the quality correlation term in Eq. help to lower the prediction error. For example, in question quality prediction, while the transferred features help to lower the prediction error by 4.15%, the quality correlation term further lowers the error by 3.50%.

(C) The Effectiveness of CoPs to Deal With Sparsity. As mentioned before, our methods could benefit the case when the available quality scores are sparse. We have already shown that our method could predict the quality accurately when only a small set of questions and their answers are labeled. To further validate the capability of CoPs to deal with the sparsity problem, we consider the case when the labels of questions and answers are not simultaneously available. That is, we first fix 10% of questions as well as their associated answers as the training set, and then delete the available answer scores. We compare the results with the Separate method in Fig. ?. As we can see from Fig. ?, our four methods are all better than the Separate method in terms of the prediction error. For example, in the extreme case when there are no answer scores available (i.e., 0% answers), the Separate method can only guess the labels randomly, while our CoPs-QQ method can still predict the quality with prediction error less than 0.35. As to the comparison among our four methods, the CoPs-QQ method performs better than others when there are no answer scores available, and the four methods become close to each other wrt prediction error as the available answer scores increase. In Fig. ?, we also show the corresponding prediction error for questions. As we can see, in all cases, our methods are better or at least close to the Separate method in terms of prediction error.

Similarly, we also choose 10% of questions and their associated answers as the training set, and then delete the available question scores. The results are shown in Fig. ?. Again, as we can see from Fig. ?, our methods are better wrt prediction error when the available question labels are sparse. The corresponding answer quality prediction results are also shown in Fig. ?, and our four methods are substantially better than the Separate method wrt prediction error.

5.3Efficiency Results

Figure 2: The wall-clock time of our methods. Our methods scale linearly wrt the data size (n_q + n_a).
Figure 2: The wall-clock time of our methods. Our methods scale linearly wrt the data size ().

Here, we also study the efficiency of CoPs by recording the wall-clock time in the training step. First, we vary the size of the training set, and plot the training time against the number of the questions and answers in the training set. The results of our four methods (i.e., CoPs-QQCoPs-QGCoPs-GGCoPs-GQ) are shown in Figure 2. As we can see, all our four methods scale linearly wrt the size of the training set, which is consistent to the algorithm analysis in Lemma ?. Notice that the CoPs-QQ method is much faster than the other three methods because we could derive the closed-form solution for it.

Next, we consider to combine effectiveness with efficiency, and define the utility ratio as . This utility ratio reflects the prediction accuracy in a given time period, and the higher the utility ratio the better the method. We compare the utility ratio of the CoPs-QQ method with that of the CQA-MR method for both question prediction and answer prediction in Fig. ?. As we can see, CoPs-QQ performs better in both question prediction and answer prediction. Overall, among all the four variants of CoPs, CoPs-QQ has the similar accuracy as others, while runs faster. Based on these efficiency results, together with the effectiveness results, we recommend CoPs-QQ in practice.

5.4Parameter Sensitivity

Finally, we test the parameter sensitivity of CoPs. There are two parameters in Eq., i.e., which indicates the importance of the quality correlation term, and which controls the amount of regularization. The results are shown in Fig. ?. As we can see, the prediction error of our methods stays stable when , which means that we may still have good prediction results when we put the same emphasis on the quality correlation term compared to the previous two terms in the equation. For simplicity, We fix in our experiments. As for , the prediction accuracy is stable over a large range , and we fix it to 0.01. Based on these results, we conclude that our methods are robust wrt the two parameters in a wide range.

6Related Work

In this section, we briefly review related work including question/answer quality prediction, empirical studies on Stack Overflow, etc.

Question/Answer Quality Prediction:

Question answering websites have become valuable knowledge bases which receive millions of visits and queries each day. As a result, several methods have been proposed to identify relevant questions for a given query (e.g. [24]). To further improve the usefulness of the returned questions, the quality of these questions should also be considered. For example, Song et al. [19] define question quality as the likelihood that a question is repeatedly asked by people, and evaluate such measurement in the setting of question search. In addition to re-ranking the returned questions for a given query, question quality can be used to recommend questions to prominent places so that users can easily discover them [20].

Similar to question quality prediction, answer quality prediction could also be used to directly identify high-quality answers for a given user query. For example, Jeon et al. [9] and Suryanto et al. [21] use human annotators to label the quality of the answers, and evaluate the usefulness of answer quality by incorporating it to improve retrieval performance.

The main focus on the above work is to employ the quality of questions/answers so as to improve the performance of information retrieval. In contrast, we observe the positive quality correlation between questions and their answers, and propose to jointly predict the quality of questions and answers by leveraging such correlation. As a result, on one hand, we can predict the quality of questions/answers more accurately; and on the other hand, we can exploit the usage of question-answer pairs.

Agichtein et al. [2] and Bian et al. [5] also aim to predict the quality of both questions and answers. Their focus is to tackle the sparsity problem where only a small number of questions/answers are labeled. As shown in our experiments, our method can also deal with the sparsity problem by leveraging the quality correlation between questions and their answers. In the methodology aspect, Agichtein et al. [2] still treat question quality prediction and answer quality prediction as separated problems. For the method proposed by Bian et al. [5], as pointed in section 5.1, there are two important differences between their method (CQA-MR) and the our CoPs, which lead to significant performance difference (See Fig. ? for an example).

Quality Measurement:

There are several types of measurement to quantify the quality of questions and answers. First, Liu et al. [13] propose to measure the questioner satisfaction, i.e., which answer the questioner will probably choose as the accepted answer. Later on, this problem is followed up by several researchers [18]. However, accepted answers are not necessarily the highest-quality answers due to timing and subjectivity issues.

To overcome the subjectivity issue of questioner satisfaction, many proposals resort to the quality measures derived from long-term community voting or human labeling. For example, Harper et al. [7] conduct a field study on several question answering websites to seek the reasons for high-quality answers. They use the human labels as the quality indicator and find that factors such as community effect and payment play important roles in answer quality while rhetorical strategy and question type have little effect. Nahehi et al. [15] study the answer quality of code examples in Stack Overflow. They use the community voted score of an answer as the quality measure. Similar to this work, we also use the voted score, which is the difference between the number of up-votes and down-votes from the community, as the quality measure.

Empirical Study on Stack Overflow:

Due to the great value of CQA in helping software development, many empirical studies are conducted on Stack Overflow. For example, Treude et al. [23] investigate the website to identify which types of questions are frequently asked and answered by the programmers. Parnin et al. [17] study whether Stack Overflow can be used as a substitute of API documentation. Barua et al. [4] analyze the text content of the posts in Stack Overflow to discover the current hot topics that software developers are discussing. Mamykina et al. [14] try to find the success design choices of Stack Overflow so that the lessons can be reused for other applications. One of the main reasons that Mamykina et al. found is the tight involvement of founders and moderators in the community. Our work could be used to automatically support their moderation by identifying the high-quality and low-quality posts in their early stage. In summary, different from the existing empirical studies, our focus is on the quality of questions and answers in Stack Overflow, and such post quality is essential for the reuse of CQA knowledge.

Other Related Work:

There are some other recent focuses that are potentially related to our work. For example, Tausczik and Pennebaker [22] empirically study the correlation between user reputation and post quality in MathOverflow, and find that both offline and online reputation points are good predictors for post quality. Gottipati et al. [6] focus on how to find relevant answers in software forum when there could be many answers for a single question. Kumar et al. [10] reveal the mutual effect between question and answer dynamics in Stack Overflow, and prove that certain equilibrium can be achieved from a theoretical perspective. Liu et al. [12] propose the problem of CQA searcher satisfaction, i.e., will the answer in a CQA satisfies the information searcher using the search engines. They divide the searcher satisfaction problem into three subproblems (i.e., query clarity, query-question match, and answer quality) and conclude that more intelligent prediction of answer quality is still in need. How to route the right question to the right answerer [25], and how to predict the long-lasting value (i.e., the page views of a question and its answers) [3], are also studied by several researchers.

7Discussions and Future Work

There are several issues we need to clarify and discuss in this work.

First, although to some extent it is intuitive that the quality of questions and that of their answers are correlated, the empirical verification of this intuition is missing. This is probably due to the fact that most CQA websites do not contain suitable quality measures for both questions and answers. As a result, the state-of-the-art methods mainly use human annotators to label a small set of questions/answers, and then train a model based on these labels to predict the quality of other questions/answers. We would like to point out that, even in those cases where human annotation is used, the quality correlation between questions and their answers could still be used to improve the prediction accuracy. In addition, we have shown in our experiments that our method can tackle the sparsity problem when only a small number of human-annotated labels are available. Therefore, our method is applicable and helpful in many quality prediction tasks in CQA websites.

Second, in the context of Stack Overflow, predicting the quality of question-answer pairs in their early stage is also important. For example, high-quality question and high-quality answer pairs can be highlighted to attract more attention, high-quality question and low-quality answer pairs can be recommended to experts, and low-quality question and low-quality answer pairs can be considered for community moderation. In this work, we fix the time window of the early stage to 24 hours. In the future work, we plan to change the time window to 3 hours, 1 hour, or even the time before any answers have been posted.

Another direction of future work is to investigate the evolution between question quality and answer quality. It would be interesting to study the interplay between question quality and answer quality over time, and it is also interesting to employ this dynamics to further improve the accuracy of quality prediction.

8Conclusions

In this paper, we study the relationship between the quality of questions and answers in the CQA setting. We start with an empirical study of the Stack Overflow dataset where we observe a strong quality correlation between questions and their associated answers. Armed with this observation, we next propose a family of algorithms to jointly predict the quality of questions and answers, for both quantifying numerical quality scores and differentiating the high-quality questions/answers from those of low quality. Extensive experimental evaluations show that our methods outperform the state-of-the-art methods in effectiveness, and scale linearly wrt the number of questions and answers.


Here, we present the detailed algorithms for Alg. ?.

The core of the algorithm is to iteratively update the coefficients (step 6-7):

where is the learning step size, and the partial derivatives can be generally computed as:

Notice that we have four variants based on the combination of loss functions, and the partial derivatives for each variants can be computed as follows:

Footnotes

  1. http://blog.stackoverflow.com/2012/08/stack-exchange-machine-learning-contest/
  2. http://www.alexa.com/
  3. http://blog.stackoverflow.com/category/cc-wiki-dump/

References

  1. Knowledge sharing and yahoo answers: everyone knows something.
    L. Adamic, J. Zhang, E. Bakshy, and M. Ackerman. In WWW, pages 665–674. ACM, 2008.
  2. Finding high-quality content in social media.
    E. Agichtein, C. Castillo, D. Donato, A. Gionis, and G. Mishne. In WSDM, pages 183–194. ACM, 2008.
  3. Discovering value from community activity on focused question answering sites: a case study of stack overflow.
    A. Anderson, D. Huttenlocher, J. Kleinberg, and J. Leskovec. In KDD, pages 850–858. ACM, 2012.
  4. What are developers talking about? an analysis of topics and trends in stack overflow.
    A. Barua, S. Thomas, and A. Hassan. Empirical Software Engineering, pages 1–36, 2012.
  5. Learning to recognize reliable users and content in social media with coupled mutual reinforcement.
    J. Bian, Y. Liu, D. Zhou, E. Agichtein, and H. Zha. In WWW, pages 51–60. ACM, 2009.
  6. Finding relevant answers in software forums.
    S. Gottipati, D. Lo, and J. Jiang. In ASE, pages 323–332, 2011.
  7. Predictors of answer quality in online q&a sites.
    F. Harper, D. Raban, S. Rafaeli, and J. Konstan. In CHI, pages 865–874. ACM, 2008.
  8. The anatomy of a large-scale social search engine.
    D. Horowitz and S. Kamvar. In WWW, pages 431–440. ACM, 2010.
  9. A framework to predict the quality of answers with non-textual features.
    J. Jeon, W. Croft, J. Lee, and S. Park. In SIGIR, pages 228–235. ACM, 2006.
  10. Evolution of two-sided markets.
    R. Kumar, Y. Lifshits, and A. Tomkins. In WSDM, pages 311–320. ACM, 2010.
  11. G-finder: routing programming questions closer to the experts.
    W. Li, C. Zhang, and S. Hu. In OOPSLA, pages 62–73, 2010.
  12. Predicting web searcher satisfaction with existing community-based answers.
    Q. Liu, E. Agichtein, G. Dror, E. Gabrilovich, Y. Maarek, D. Pelleg, and I. Szpektor. In SIGIR, pages 415–424, 2011.
  13. Predicting information seeker satisfaction in community question answering.
    Y. Liu, J. Bian, and E. Agichtein. In SIGIR, pages 483–490. ACM, 2008.
  14. Design lessons from the fastest q&a site in the west.
    L. Mamykina, B. Manoim, M. Mittal, G. Hripcsak, and B. Hartmann. In CHI, pages 2857–2866. ACM, 2011.
  15. What makes a good code example? a study of programming q&a in stackoverflow.
    S. Nasehi, J. Sillito, F. Maurer, and C. Burns. 2012.
  16. Getting the most out of the web.
    T. Osbourn. Software, IEEE, 28(1):96–96, 2011.
  17. Crowd documentation: Exploring the coverage and the dynamics of api discussions on stack overflow.
    C. Parnin, C. Treude, L. Grammel, and M. Storey. Georgia Institute of Technology, Tech. Rep.
  18. Evaluating and predicting answer quality in community qa.
    C. Shah and J. Pomerantz. In SIGIR, pages 411–418. Citeseer, 2010.
  19. Question utility: A novel static ranking of question search.
    Y. Song, C. Lin, Y. Cao, and H. Rim. In AAAI, pages 1231–1236, 2008.
  20. Learning to recommend questions based on user ratings.
    K. Sun, Y. Cao, X. Song, Y. Song, X. Wang, and C. Lin. In CIKM, pages 751–758. ACM, 2009.
  21. Quality-aware collaborative question answering: methods and evaluation.
    M. Suryanto, E. Lim, A. Sun, and R. Chiang. In WSDM, pages 142–151. ACM, 2009.
  22. Predicting the perceived quality of online mathematics contributions from users’ reputations.
    Y. Tausczik and J. Pennebaker. In CHI, pages 1885–1888. ACM, 2011.
  23. How do programmers ask and answer questions on the web?: Nier track.
    C. Treude, O. Barzilay, and M. Storey. In ICSE, pages 804–807. IEEE, 2011.
  24. Learning to suggest questions in online forums.
    T. Zhou, C. Lin, I. King, M. Lyu, Y. Song, and Y. Cao. In AAAI, 2011.
  25. Routing questions to the right users in online communities.
    Y. Zhou, G. Cong, B. Cui, C. Jensen, and J. Yao. In ICDE, pages 700–711. IEEE, 2009.
10024
This is a comment super asjknd jkasnjk adsnkj
""
The feedback cannot be empty
Submit
Cancel
Comments 0
""
The feedback cannot be empty
   
Add comment
Cancel

You’re adding your first comment!
How to quickly get a good reply:
  • Offer a constructive comment on the author work.
  • Add helpful links to code implementation or project page.