DiffQue: Estimating Relative Difficulty of Questions in Community Question Answering Services
Automatic estimation of relative difficulty of a pair of questions is an important and challenging problem in community question answering (CQA) services. There are limited studies which addressed this problem. Past studies mostly leveraged expertise of users answering the questions and barely considered other properties of CQA services such as metadata of users and posts, temporal information and textual content. In this paper, we propose DiffQue, a novel system that maps this problem to a network-aided edge directionality prediction problem. DiffQue starts by constructing a novel network structure that captures different notions of difficulties among a pair of questions. It then measures the relative difficulty of two questions by predicting the direction of a (virtual) edge connecting these two questions in the network. It leverages features extracted from the network structure, metadata of users/posts and textual description of questions and answers. Experiments on datasets obtained from two CQA sites (further divided into four datasets) with human annotated ground-truth show that DiffQue outperforms four state-of-the-art methods by a significant margin (28.77% higher F1 score and 28.72% higher AUC than the best baseline). As opposed to the other baselines, (i) DiffQue appropriately responds to the training noise, (ii) DiffQue is capable of adapting multiple domains (CQA datasets), and (iii) DiffQue can efficiently handle ‘cold start’ problem which may arise due to the lack of information for newly posted questions or newly arrived users.
Programmers these days often rely on various community-powered platforms such as Stack Overflow, MathOverflow etc. – also known as Community Question Answering (CQA) services to resolve their queries. A user posts a question/query which later receives multiple responses. The user can then choose the best answer (and mark it as ‘accepted answer’) out of all the responses. Such platforms have recently gained huge attention due to various features such as quick response from the contributers, quick access to the experts of different topics, succinct explanation, etc. For instance, in August 2010, Stack Overflow accommodated users and questions; these numbers have currently jumped to users and questions posted111https://stackexchange.com/sites/. This in turn provides tremendous opportunity to the researchers to consider such CQA services as large knowledge bases to solve various interesting problems (Vasilescu et al., 2014; Liu et al., 2013; Yang et al., 2008).
Plenty of research has been conducted on CQA services, including question search (Xue et al., 2008), software development assistance (de Souza et al., 2014), question recommendation (San Pedro and Karatzoglou, 2014), etc. A significant amount of study is done in estimating user expertise (Liu et al., 2011), recommending tags (Wang et al., 2014b), developing automated techniques to assist editing a post (Chen et al., 2017), etc. Studies are also conducted to help developers mapping their queries to the required code snippets (Campbell and Treude, 2017).
Problem Definition and Motivation: In this paper, we attempt to automatically estimate the relative difficulty of a question among a given pair of questions posted on CQA services. Such a system would help experts in retrieving questions with desired difficulty, hence making best use of time and knowledge. It can also assist to prepare a knowledge base of questions with varying level of difficulty. Academicians can use this system to set up their question papers on a particular topic. The proposed solution can be used in variety of scenarios; examples of such cases include: question routing, incentive mechanism, linguistic analysis, analysing user behaviour, etc. In question routing, questions are recommended to the specified user as per his/her expertise. In incentive mechanism, point/reputation is allocated to the answerers depending upon the difficulty of question. Linguistic analysis allows to find similarity between language of question and its difficulty, and the significance of how it is framed and presented to audience. Analysing user behaviour identifies the preference of users in answering the questions, and the strategies used by them to increase their reputation in the system. We present a top example of two such use cases in Figure 1. However, computing difficulty of a question is a challenging task as we cannot rely only on textual content or reputation of users. High reputation of a user does not always imply that his/her posted questions will be difficult. Similarly, a question with embedded code (maybe with some obfuscation) does not always indicate the difficulty level. Therefore, we need to find a novel solution which uses different features of a question and interaction among various users to learn characteristics of the question and estimates the difficulty level. To our knowledge, there are only three works (Liu et al., 2013; Yang et al., 2008; Wang et al., 2014a) which tackled this problems before. The major limitations of these methods are as follows: (i) they do not consider the varying difficulty of questions raised by same person; (ii) they often ignore low-level metadata; and (iii) none of them considered the temporal effect.
Proposed Framework: In this paper, we propose DiffQue, a relative difficulty estimation system for a pair of questions that leverages a novel (directed and temporal) network structure generated from the user interactions through answering the posted questions on CQA services. DiffQue follows a two-stage framework (Figure 2). In the first stage, it constructs a network whose nodes correspond to the questions posted, and edges are formed based on a set of novel hypotheses (which are statistically validated) derived from the temporal information and user interaction available in CQA services. These hypotheses capture different notion of ‘difficulties among a pair of questions’. In the second stage, DiffQue maps the ‘relative question difficulty estimation’ problem to an ‘edge directionality prediction’ problem. It extracts features from the user metadata, network structure and textual content, and runs a supervised classifier to estimate relatively difficult question among a question pair. In short, DiffQue systematically captures three fundamental properties of a CQA service – user metadata, textual content and temporal information.
Summary of the Results: We evaluate the performance of DiffQue on two CQA platforms – Stack Overflow (which is further divided into three parts based on time) and Mathematics Stack Exchange. We compare DiffQue with three other state-of-the-art techniques – RCM (Wang et al., 2014a), Trueskill (Liu et al., 2013) and PageRank (Yang et al., 2008) along with another baseline (HITS) we propose here. All these baselines leverage some kind of network structure. Experimental results on human annotated data show that DiffQue outperforms all the baselines by a significant margin – DiffQue achieves F1 score (resp. AUC) on average across all the datasets, which is (resp. ) higher than the best baseline. We statistically validate our network construction model and show that if other baselines had leveraged our network structure instead of their own, they could have improved the accuracy significantly (on average 2.67%, 9.46% and 17.84% improvement for RCM, Trueskill and PageRank respectively; however HITS originally leverages our network) compared to their proposed system configuration. Further analysis on the performance of DiffQue reveals that – (i) DiffQue appropriately responds to the training noise, (ii) DiffQue is capable of adapting multiple domains, i.e., if it is trained on one CQA platform and tested on another CQA platform, the performance does not change much as compared to the training and testing on same domain, and (iii) DiffQue can handle cold start problem which may arise due to insufficient information of posts and users.
Contribution of the Paper: In short, the major contributions of the paper are four-fold:
We propose a novel network construction technique by leveraging the user interactions and temporal information available in CQA services. The network provides a relative ordering of pairs of questions based on the level of difficulty. We also show that the baselines could improve their performance if they use our network instead of theirs.
We map the problem of ‘relative difficulty estimation of questions’ to an ‘edge directionality prediction’ problem, which, to our knowledge, is the first attempt of this kind to solve this problem. Our proposed method utilizes three fundamental properties of CQA services – user information, temporal information and textual content.
DiffQue turns out to be superior to the state-of-the-art methods – it not only beats the other baselines in terms of accuracy, but also appropriately responds to the training noise and handles cold start problem.
As a by-product of the study, we generated huge CQA datasets and manually annotated a set of question pairs based on the difficult level. This may become a valuable resource for the research community.
For the sake of reproducible research, we have made the code and dataset public at https://github.com/LCS2-IIITD/DiffQue-TIST.
2. Related Work
Recently, we have seen expansion of various CQA services, including Stack Overflow, Quora, Reddit etc. There are number of studies that involve these CQA services. Here we organize the literature review in two parts: general study involving CQA and studies on difficulty ranking of questions.
2.1. General Study Involving CQA and Ranking mechanism
CQA has grown tremendously in size, thus providing a huge databases to the research community. With such active contribution from the community, the data can be used for many important problems. Nasehi et al. (2012) mined the Stack Overflow data and provided ways to identify helpful answers. They found characteristics of good code example and different attributes that should accompany the code to make it more understandable. Chowdhury and Chakraborty (2019) released CQASUMM, the first annotated CQA summarization dataset, and developed VIZ-Wiki, a browser extension to summarize the threads in CQA sites (Chowdhury et al., 2018).
Zhou et al. (2009) recommended questions available in CQA to experts based on their domain expertise. They find top potential experts for a question in the CQA service. Guo et al. (2009) observed user’s lifetime in CQA services and found that it does not follow exponential distributions. There are many studies that analyzed temporal effect in CQA sites (Hong and Shen, 2009; Yang and Leskovec, 2011; Cardon et al., 2011). Cardon et al. (2011) showed that there are two strategies to develop online authority: progressively developing reputation or by using prior acquired fame. Hong and Shen (2009) found online user activities time-dependent, and proposed question routing mechanism for timely answer. Yang and Leskovec (2011) developed K-SC algorithm on Twitter dataset and showed that online content exhibits six temporal shapes. Hu et al. (2015) found three parameters to capture temporal significance and to personalize PageRank – frequency factor (trustworthy users are often active), built-up time length factor (the longer the time between registration of user and link creation, the more trustworthy the user) and similarity factor (the pattern based on which two users add links). Many efforts were made in the field of expert identification, which mostly focus on link analysis and latent topic modeling techniques (Yang et al., 2013). Bouguessa et al. (2008) focused on best answers provided by users to quantify experts. It is based on the concept of in-degree. Weng et al. (2010) combined topical similarity with link analysis for expert identification. Some techniques leveraged Gaussian mixture models to solve the problem (Pal et al., 2012; Pal and Counts, 2011). Jeon et al. (2005) attempted to find similar questions and recommend them by exploring semantic features.
Appropriate Ranking of entities (documents, search results, recommended items) has been one of the major concerns in information retrieval and data mining communities. Pioneering attempts were made in late 90’, which resulted several query-independent ranking schemes such as PageRank (Page et al., 1999), TrustRank (Gyöngyi et al., 2004), BrowseRank (Liu et al., 2008), etc. The limitations of these ranking mechanisms were further addressed in several query-dependent ranking models such as HITS (Kleinberg, 1999a), SALSA (Lempel and Moran, 2001), LSI (Deerwester et al., 1990), vector space model (Baeza-Yates et al., 2011), statistical language model (Zhai, 2008), etc. The problem in such conventional ranking models is that the model parameters are usually hard to tune manually, which may sometime lead to overfitting. Moreover, it is non-trivial to combine the large number of conventional models proposed in the literature to obtain an even more effective model. To overcome these problem, “learning-to-rank” was introduced that uses machine learning technologies to solve the problem of ranking (Liu et al., 2009). Methods dealing with learning-to-rank can be categorized into three classes (Cao et al., 2007) – pair-wise (Chu and Ghahramani, 2005a, b), point-wise (Qin et al., 2007; Tsai et al., 2007), and list-wise (Burges et al., 2007; Cao et al., 2007). A number of machine learning algorithms have been proposed which optimize relaxations to minimize the number of pairwise misordering in the ranking produced. Examples include RankNet (Burges et al., 2005), RankRLS (Pahikkala et al., 2009), RankSVM (Lee and Lin, 2014; Joachims, 2002; Herbrich et al., 1999), etc. Freund et al. (2003) proposed RankBoost, an adaptation of boosting mechanism for learning rank. Rijke (2019) used reinforcement learning for ranking and recommendation. Learning-to-rank is also used in many applications – specification mining (Cao et al., 2018), named-entity recognition (Nguyen et al., 2018), crowd counting (Liu et al., 2018), etc. Machine learning community has also witnessed an adaptation of deep learning techniques for learning-to-rank (Cheng et al., 2018; He et al., 2018; Sasaki et al., 2018). Nevertheless, we here consider RankSVM and RankBoost as two basedlines for DiffQue.
2.2. Difficulty Ranking of Questions
Hassan et al. ([n.d.]) developed SOQDE, a supervised learning based model and performed an empirical study to determine how the difficulty of a question impacts its outcome. Arumae et al. (2018) used variable-length context convolutional neural network model to investigate whether it is possible to predict if question will be answered. Chang and Pal (2013) developed a routing mechanism that uses availability and expertise of the users to recommend answerers to a question. Yang et al. (2008) and Liu et al. (2013) attempted to solve the problem of finding the difficulty of tasks taking the underlying network structure into account. Ranking of questions according to the difficulty is a complex modeling task and highly subjective. Yang et al. (2008) constructed a network in such a way that if user answers questions and , among which only the answer of is accepted, then there will be a directed edge from to . Following this, they used PageRank as a proxy of the difficulty of a question. Liu et al. (2013) created a user network where each question is treated as a pseudo user U. It considers four competitions: one between pseudo user U and asker U, one between pseudo user U and the best answerer U, one between the best answerer U and asker U and one between the best answerer U and each of the non-best answerers of the question . It then uses Trueskill, a Bayesian skill rating model. Both Yang et al. (2008) and Liu et al. (2013) used neither textual content nor temporal effect, which play a crucial role in determining the importance of questions over time. Wang et al. (2014a) used the same graph structure as (Liu et al., 2013) and proposed Regularized Competition Model (RCM) to capture the significance of difficulty. It forms , denoting the ‘expert score’ of pseudo users – initial entries are expertise of users while further are difficulty of questions. For each of the competitions, vector is formed where = 1, = -1 and = 1 if wins over , else = -1. The algorithm starts at initial and proceeds towards negative subgradient, = - *(), where () is the subgradient and is set as 0.001. Further, classification of examination questions according to Bloom’s Taxonomy (Fattah et al., 2007) tried to form an intuition of difficulty of questions; but it is not very helpful in case of CQA services where there are many users, and individual users have their own ways of expressing the questions, or sometimes there is code attached with question where normal string matching can trivialize the problem. We consider three methods (Liu et al., 2013; Yang et al., 2008; Wang et al., 2014a) mentioned above as baselines for DiffQue.
How DiffQue differs from others: Yang et al. (2008) ran PageRank on a network based on a single hypothesis that if user wins in task X and not in Y, then there is directed edge from X to Y. They used PageRank to compute the hardness of questions. In their method, PageRank can also be replaced by authoritativeness (Kleinberg, 1999b) to rank the questions (which we also treat as a baseline). Both of these do not take into account some important points such as varying of difficulty of questions asked by the same person, or different persons who answered their questions. Liu et al. (2013) and Wang et al. (2014a) used the same network, and applied Bayesian network and regularisation respectively. Liu et al. (2013) did not consider cold start problem and textual content, while Wang et al. (2014a) did not take low-level features such as accepted answers, time difference between posting and acceptance, etc. The temporal effect which plays a significant role in this type of setting is also not discussed in the previous works. DiffQue takes into account all the limitations/disadvantages of the previous works mentioned above.
We collected questions and answers from two different CQA services – (i) Stack Overflow222https://stackoverflow.com/ (SO) and Mathematics Stack Exchange333https://math.stackexchange.com/ (MSE), both of which are extensively used by programmers or mathematicians to get their queries resolved. A brief description of the datasets are presented below (see Table 1). We chose SO because it is the most preferable platform for programmers to resolve queries and has a very active community. Such a high level of engagement among users would help in modeling the underlying network better. Similarly, MSE is widely used by mathematicians and has an active community. Most importantly, both the datasets are periodically archived for research purposes.
3.1. Stack Overflow (SO) Dataset
This dataset contains all the questions and the answers of Stack Overflow till Aug’17, accommodating more than questions (70% of which are answered) and more than answers in total. In this work, we only considered ‘Java-related’ questions as suggested in (Liu et al., 2013). Previous work (Liu et al., 2013) considered all questions and answers submitted to SO during Aug’08-Dec’10 (referred as SO1). Along with this dataset, we also considered two more datasets – we collected two recent datasets from SO posted in two different time durations – questions and answers posted in Jan’12 – Dec’13 (referred as SO2) and in Aug’15 – Aug’17 (referred as SO3). Analyzing datasets posted in different time points can also help in capturing the dynamics of user interaction in different time points, which may change over time. User metadata was also available along with the questions and answers.
3.2. Mathematics Stack Exchange (MSE) Dataset
This dataset contains all the questions posted on MSE till Aug’17 and their answers. There are about questions and about a million answers. About 75% questions are answered. Here we only extracted questions related to the following topics: Probability, Permutation, Inclusion-Exclusion and Combination444The topics were chosen based on the expertise of the human annotators who further annotated the datasets.. The questions and answers belonging to these topics are filtered from the entire database to prepare our dataset. The number of users in the chosen sample is . Unlike SO, we did not divide MSE into different parts because total number of questions present in MSE is less compared to SO (less than even a single part of SO). Further division might have reduced the size of the dataset significantly.
Both these datasets are available as dump in xml format. Each post (question/answer) comes with other metadata: upvotes/downvotes, posting time, answer accepted or not, tag(s) specifying the topics of the question etc. Similarly, the metadata of users include account id, registration time and reputation. Unlike other baselines (Liu et al., 2013; Yang et al., 2008; Wang et al., 2014a), we also utilize these attributes in DiffQue.
|Dataset||# questions||# answers||# users||Time period|
|SO1||100,000||289,702||60,443||Aug’08 – Dec’10|
|SO2||342,450||603,402||179,827||Jan’12 – Dec’13|
|SO3||440,464||535,416||274,421||Aug’15 – Aug’17|
|MSE||92,686||119,754||47,470||July’10 – Aug’17|
4. DiffQue: Our Proposed Framework
DiffQue first maps a given CQA data to a directed and longitudinal555Nodes are arranged based on their creation time. network where each node corresponds to a question, and an edge pointing from one question to another question indicates that the latter question is harder than the former one. Once the network is constructed, DiffQue trains an edge directionality prediction model on the given network and predicts the directionality of a virtual edge connecting two given questions under inspection. The framework of DiffQue is shown in Figure 2. Rest of the section elaborates individual components of DiffQue.
Nomenclature: Throughout the paper, we will assume that Bob has correctly answered Robin’s question on a certain topic, and therefore Bob has more expertise than Robin on that topic. Assumption: Whenever we mention that Bob has more expertise than Robin, it is always w.r.t. a certain topic, which may not apply for other topics.
4.1. Network Construction
DiffQue models the entire dataset as a directed and longitudinal network , where indicates a set of vertices and each vertex corresponds to a question; is a set of edges. Each edge can be of one of the following three types mentioned below.
Edge Type 1: An expert on a certain topic does not post trivial questions on CQA sites. Moreover, s/he answers those questions which s/he has expertise on. We capture these two notions in Hypothesis 1.
Hypothesis 1 ().
If Bob correctly answers question asked by Robin on a certain topic, then the questions asked by Bob later on that topic will be considered more difficult than .
The ‘correctness’ of an answer is determined by the ‘acceptance’ status of the answer or positive number of upvotes provided by the users (available in CQA services).
Let be the question related to topic Robin posted at time and Bob answered . Bob later asked questions related to , namely at respectively. Then the difficulty level of each , denoted by will be more than that of , i.e., . The intuition behind this hypothesis is as follows – since Bob correctly answered , Bob is assumed to have more expertise than the expertise required to answer . Therefore, the questions that Bob will ask later from topic may need more expertise than that of .
We use this hypothesis to draw edges from easy questions to difficult questions as follows: an edge of type 1 indicates that is more difficult than according to Hypothesis 1. Moreover, each such edge will always be a forward edge, i.e., a question asked earlier will point to the question asked later. Edge in Figure 3 is of type 1.
One may argue that the answering time is also important in determining type 1 edges – if at Bob answers which has been asked by Robin at (where ), we should consider all Bob’s questions posted after (rather than ) to be difficult than . However, in this paper we do not consider answering time separately and assume question and answering times to be the same because the time difference between posting the question and correctly answering the question, i.e., seems to be negligible across the datasets (on average , , , days for SO1, SO2, SO3 and MSE respectively).
Edge Type 2: It is worth noting that an edge of type 1 only assumes Bob’s questions to be difficult which will be posted later. It does not take into account the fact that all Bob’s contemporary questions (posted very recently in the past on the same topic) may be difficult than Robin’s current question; even if the former questions may be posted slightly before the latter question. We capture this notion in Hypothesis 2 and draw type 2 edges.
Hypothesis 2 ().
If Bob correctly answers Robin’s question related to topic , then Bob’s very recent posted questions on will be more difficult than .
Let be the questions posted by Bob at respectively. Bob has answered Robin’s question posted at and . However, the difference between and is significantly less, i.e., which indicates that all these questions are contemporary. According to Hypothesis 2, .
We use this hypothesis to draw edges from easy questions to difficult questions as follows: an edge of type 2 indicates that is more difficult than , and was posted within times before the posting of . Note that each such edge will be a backward edge, i.e., a question asked later may point to the question asked earlier. Edge in Figure 3 is of type 2.
Edge Type 3: We further consider questions which are posted by a single user on a certain topic over time, and propose Hypothesis 3.
Hypothesis 3 ().
A user’s expertise on a topic will increase over time, and thus the questions that s/he will ask in future related to that topic will keep becoming difficult.
Let be the questions posted by Bob666Same hypothesis can be applied to any user (Bob/Robin). at , where . Then . The underlying idea is that as the time progresses, a user gradually becomes more efficient and acquires more expertise on a particular topic. Therefore, it is more likely that s/he will post questions from that topic which will be more difficult than his/her previous questions.
We use this hypothesis to draw an edge from easy questions to difficult questions as follows: an edge of type 3 indicates that: (i) both and were posted by the same user, (ii) was posted after and there was no question posted by the user in between the posting of and (i.e., ), and (iii) is more difficult than . Note that each such edge will also be a forward edge. Edge in Figure 3 is of type 3, but and are not connected according to this hypothesis.
Note that an edge can be formed by more than one of these hypotheses. However, we only keep one instance of such edge in the final network (and avoid multi-edge network). In Section 4.3, we will show that all these hypotheses are statistically significant. The number of edges of each type is shown in Table 2. Note that the hypotheses proposed in (Liu et al., 2013) are different from those we proposed here. The earlier work did not consider the temporal dimension at all. They compared question asker, question, best answerer, and all others who answered the question. Further their hypotheses related question, answerer and other answerer only within a thread and not across different threads. On the other hand, hypothesis 1 proposed in this paper establishes a relationship between the questions that are answered and all questions asked by the same answerer in future. Our proposed hypothesis 2 establishes a relationship between recently asked questions by the answerer and the question answered by the answerer.
4.2. Parameters for Network Construction
There are two parameters to construct the network:
Time of posting:
Each post (question/answer) is associated with an (absolute) posting time which is used to decide the edge type. Here, instead of considering the ‘absolute time’ as the posting time, we divide the entire timespan present in a dataset into 2-week intervals (called ‘buckets’). We assume that those posts which were posted within a bucket have same posting time, and therefore, Hypotheses 1 and 2 may connect them in the network. The reasons behind choosing bucketed time instead of absolute time are two-fold: (i) we believe that questions posted by a user around the same time would be of same difficulty as s/he would not have gained much expertise on that topic in such a short period of time; (ii) considering absolute time may increase the number of edges in the network, which may unnecessarily densify the network and make the entire system memory intensive. We chose bucketed interval as 2 weeks as it produces the best result over a range from 1 week to 6 weeks (Table 5). The bucket size matters for constructing the network as well as measuring few features (mentioned in Section 4.4).
Recency of questions for type 2 edges: For edge type 2, we have introduced to quantify the recency / contemporariness of Bob’s question (see the description of edge type 2). has been asked ‘very recently’ before Robin’s question which Bob has correctly answered. We vary from to and observe that produces the best result (Table 5).
|Edge type 1||Edge type 2||Edge type 3|
Example 4.1 ().
Let us consider that Robin has asked three questions, , and at time , and respectively, and Bob has correctly answered . Bob has also asked four questions , , and at time , , and respectively. Figure 3 shows the corresponding network for this example considering .
4.3. Hypothesis Testing
We further conducted a thorough survey to show that three hypotheses behind our network construction mechanism are statistically significant. For this, we prepared sets of edge samples, each two generated for testing each hypothesis as follows:
Sample 1: Choose edges of type 1 randomly from the network.
Random Sample 1: Randomly select pairs of questions (may not form any edge) which obey the time constraint mentioned in Hypothesis 1.
Sample 2: Choose edges of type 2 randomly from the network.
Random Sample 2: Randomly select pairs of questions such that they follow the recency constraint mentioned in Hypothesis 2.
Sample 3: Choose edges of type 3 randomly from the network.
Random Sample 3: Randomly select pairs of questions such that they follow the time constraint mentioned in Hypothesis 3.
The survey was conducted with human annotators of age between 25-35, who are experts on Java and math-related domains. For a question pair () taken from each sample, we hypothesize that . Null hypothesis rejects our hypothesis. Given a question pair, we asked each annotator to mark ‘Yes’ if our hypothesis holds; otherwise mark ‘No’. One example question pair for each sample and the corresponding human response statistics (percentage of annotators accepted our hypothesis) are shown in Table 3. Table 4 shows that for each hypothesis, the average number of annotators who accepted the hypothesis is higher for the sample taken from our network as compared to that for its corresponding random sample (the difference is also statistically significant as , Kolmogorov-Smirnov test). This result therefore justifies our edge construction mechanism and indicates that all our hypotheses are realistic.
|Hypothesis 1||Hypothesis 2||Hypothesis 3|
|Sample 1||Sample 2||Sample 3|
|(Q1) Write a program to generate all elements of power set?||(Q1) Given an array of size n, give a deterministic algorithm (not quick sort) which uses O(1) space (not median of medians) and find the Kt́h smallest item.||(Q1) Given an array of integers (each ¡= ) what is the fastest way to find the sum of powers of prime factors of each integer?|
|(Q2) How can one iterate over the elements of a stack starting from the top and going down without using any additional memory. The default iterator() goes from bottom to top. For Deque, there is a descendingIterator. Is there anything similar to this for a stack. If this is not possible which other Java data structures offer the functionality of a stack with the ability to iterate it backwards?||(Q2) Transform a large file where each line is of the form: b d, where b and d are numbers. Change it from: b -1 to b 1, where b should remain unchanged. There is a huge file and a way is required to achieve this using, say, sed or a similar tool?||(Q2) Given an array of N positive elements, one has to perform M operations on this array. In each operation a subarray (contiguous) of length W is to be selected and increased by 1. Each element of the array can be increased at most K times. One has to perform these operations such that the minimum element in the array is maximized. Only one scan is required.|
|Response: (Q2)¿(Q1)? Yes: 75%||Response: (Q2)¿(Q1)? Yes: 75%||Response: (Q2)¿(Q1)? Yes: 80%|
|Random sample 1||Random sample 2||Random sample 3|
|(Q1) If an object implements the Map interface in Java and one wish to iterate over every pair contained within it, what is the most efficient way of going through the map?||(Q1) Given a string s of length n, find the longest string t that occurs both forward and backward in s. e.g, s = yabcxqcbaz, then return t = abc or t = cba. Can this be done in linear time using suffix tree?||(Q1) What happens at compile and runtime when concatenating an empty string in Java?|
|(Q2) How to format a number in Java? Should the number be formatted before rounding?||(Q2) Given two sets A and B, what is the algorithm used to find their union, and what is it’s running time?||(Q2) What is the simplest way to convert a Java string from all caps (words separated by underscores) to CamelCase (no word separators)?|
|Response: (Q2)¿(Q1)? Yes: 20%||Response: (Q2)¿(Q1)? Yes: 35%||Response: (Q2)¿(Q1)? Yes: 30%|
|Hypothesis||Original sample||Random sample||-value|
4.4. Edge Directionality Prediction Problem
Once the network is constructed, DiffQue considers the ‘relative question difficulty estimation’ problem as an ‘edge directionality prediction’ problem. Since an edge connecting two questions in a network points to the difficult question from the easy question, given a pair of questions with unknown difficulty level, the task boils down to predicting the direction of the virtual edge connecting these two questions in the network.
The mapping of the given problem to the network-aided edge directionality prediction problem helps in capturing the overall dependency (relative difficulty) of questions. Edges derived based on the novel hypotheses lead to the generation of a directed network structure which further enables us to leverage the structural property of the network. This also helps in generating a feature vector, where each feature captures a different notion of difficulty of a question. Features allow to pool in the network topology, metadata and textual content in order to predict the relative difficulty of a question among a pair. The aim to map the given problem to the link prediction problem is to find a model which captures the role of each of the chosen features in predicting the relative hardness of question.
Although research on link prediction is vast in network science domain (LÃ¼ and Zhou, 2011), the problem of edge directionality prediction is relatively less studied. Guo et al. (Guo et al., 2013) proposed a ranking-based method that combines both local indicators and global hierarchical structures of networks for predicting the direction of edge. Soundarajan and Hopcroft (Soundarajan and Hopcroft, 2012) treated this problem as a supervised learning problem. They calculated various features of each known links based on its position in the network, and used SVM to predict the unknown directions of edges. Wang et al. (Wang et al., 2015) proposed local directed path that solves the information loss problem in sparse network and makes the method effective and robust.
We also consider the ‘edge directionality prediction problem’ as a supervised learning problem. The pairs of questions, each of which is directly connected via an edge in our network, form the training set. If the pair of questions under inspection is already connected directly by an edge, the problem will be immediately solved by looking at the directionality of the edge. In our supervised model, given a pair of questions () (one entity in the population), we first connect them by an edge if they are not connected and then determine the directionality of the edge. Our supervised model uses the following features which are broadly divided into three categories: (i) network topology based (F1-F6), (ii) metadata based (F7-F10), and (iii) textual content based (F11-F12)
[F1] Leader Follower ranking for node : Guo et al. (Guo et al., 2013) used a ranking scheme to predict the missing links in a network. The pseudo-code is presented in Algorithm 1. The algorithm takes an adjacency matrix adjMatrix(i)(j), which is if there is edge between to and otherwise. The indegree and outdegree of a node are represented by and respectively. At each step, for each node, , the difference between its in-degree and out-degree, is computed to separate leaders (high-ranked nodes) from followers (low-ranked nodes). Apart from adjacency matrix, the algorithm expects as input (set in our case as suggested in (Guo et al., 2013)), specifying the proportion of leaders at each step. The whole network is partitioned into leaders and followers, and this process continues recursively on the further groups obtained. If at any time, the number of nodes in any of the groups, is less than , then ranking is done according to . During the merging of leaders and followers, followers are placed after leaders. Finally, any node ranked lower (less important) is predicted to have an edge to higher ranked nodes (more important). In our case, the ranking is normalized by the number of questions and then used as a feature.
[F2] Leader Follower ranking for node : We use the similar strategy to measure the above rank for node .
[F3] PageRank of node : It emphasizes the importance of a node, i.e., a probability distribution signifying if a random walker will arrive to that node. We use PageRank to compute the score of each node. However, we modify the initialization of PageRank by incorporating the weight of nodes – let be the user asking question and the reputation (normalized to the maximum reputation of a user) of be . Then the PageRank score of is calculated as:
Here, is the neighbors of pointing to ; the damping factor is set to .
[F4] PageRank of node : We similarly calculate the PageRank score of node as mentioned in F3.
[F5] Degree of node : It is computed after considering the undirected version of the network, i.e., number of nodes adjacent to a node.
[F6] Degree of node : We calculate the degree of node similarly as mentioned in F5.
[F7] Posting time difference between and its accepted answer: It signifies the difference between , the posting time of and , the posting time of its accepted answer , if any. If none of the answers is accepted, it is set to 1. However, instead of taking the direct time difference, we employ an exponential decay as: . The higher the difference between the posting time and the accepted answer time (implying that users have taken more times to answer the question), the more the difficulty level of the question.
[F8] Posting time difference between and its accepted answer: Similar score is computed for as mentioned in F7.
[F9] Accepted answers of users who posted till : The more the number of accepted answers of the user asking question till ’s posting time , the more the user is assumed to be an expert and the higher the difficulty level of . We normalize this score by the maximum value among the users.
[F10] Accepted answers of users who posted till : Similar score is measured for question as mentioned in F9.
[F11] Textual feature of : This feature takes into account the text present in . The idea is that if a question is easy, its corresponding answer should be present in a particular passage of a text book. Therefore, for , we first consider its accepted answer and check if the unigrams present in that answer are also available in different books. For Java-related questions presents in SO1, SO2 and SO3, we consider the following books: (i) Core Java Volume I by Cay S. Horstmann (Horstmann, 2016a), (ii) Core Java Volume II by Cay S. Horstmann (Horstmann, 2016b), (iii) Java: The Complete Reference by Herbert Schildt (Schildt, 2014), (iv) OOP - Learn Object Oriented Thinking and Programming by Rudolf Pecinovsky (Pecinovsky, 2013), and (v) Object Oriented Programming using Java by Simon Kendal (Kendal, 2009). For Math-related questions present in MSE, we consider the following books: (i) Advanced Engineering Mathematics by Erwin Kreyszig (Kreyszig et al., 2011), (ii) Introduction to Probability and Statistics for Engineers and Scientists by Sheldon M. Ross (Ross, 2000), (iii) Discrete Mathematics and Its Applications by Kenneth Rosen (Rosen, 2003), (iv) Higher Engineering mathematics by B.S. Grewal (Grewal and Grewal, 2001), and (v) Advanced Engineering Mathematics by K. A. Stroud (Stroud and Booth, 2003). For each type of questions (Java/Math), we merge its relevant books and create a single document. After several pre-processing (tokenization, stemming etc.) we divide the document into different paragraphs, each of which forming a passage. Then we measure TF-IDF based similarity measure between each passage and the accepted answer. Finally, we take the maximum similarity among all the passages. The intuition is that if an answer is straightforward, most of its tokens should be concentrated into one passage and the TF-IDF score would be higher as compared to the case where answer tokens are dispersed into multiple passages. If there is no accepted answer for a question, we consider this feature to be .
[F12] Textual feature of : We apply the same technique mentioned in F11 and measure the textual feature of .
The proposed features are easy to compute once the network is constructed. Few features such as posting time, textual content do not require the network structure and can be derived directly from the metadata information. These features are chosen after verifying their significance on the data. They capture different notions of difficulty on a temporal scale. Next, we describe how these features can be implemented taking ‘usability’ and ‘scalability’ into account. Features which require one time computation such as degree, posting time difference between question and its accepted answer, accepted answers of question at a particular time stamp can be stored in a database (such as DynamoDB). We can compute PageRank and the leader-follower ranking in distributed way using mapreduce as shown in (Malewicz et al., 2010). The specified components are extremely fast and cost-effective. Such engineering tricks further help us in developing a real system mentioned in Section 6.
In our supervised model, for each directed edge () ( is more difficult than ), we consider () as an entity in the positive class (class 1) and () as an entity in the negative class (class 2). Therefore, in the training set the size of class 1 and class 2 will be same and equal to the number of directed edges in the overall network. We use different type of classifiers, namely SVM, Decision Tree, Naive Bayes, K Nearest Neighbors and Multilayer Perceptron; among them SVM turns out to be the best model (Table 5(c)).
Time Complexity: Let denote the number of questions, be a constant which depends on textual content and denote the number of edges. Both the hypotheses, H1 and H2, take times while H3 is computed in . For the different features we used, leader-follower Ranking takes time, PageRank takes time, degree takes time, Posting time difference between question and its accepted answer takes time, Accepted answers of users who posted question takes time and textual feature of Question takes time. Thus, the overall time complexity turns out to be .
4.5. Transitive Relationship
An important issue is the transitive relations among questions whose difficulties are determined in a pairwise manner. Let , , and be three questions, and let DiffQue produce the relative difficulty as . Transitivity would then imply that ; whereas non-transitivity condition implies that which in turn creates cycles in the underlying graph.
To test the presence of transitive vs non-transitive cases, we set out to find non-transitive cases first, i.e., cases where cycle (of different length) exists. We randomly pick a set of nodes 10,000 times, where indicates the size of the cycle we want to investigate. For each set, we check if the cycle exists in any permutation of chosen nodes. We repeat the experiment by varying from to . We observed that for any value of , in less than of cases, the cycle exists and in more than cases transitivity holds. Our claim turns out to be statistically significant with .
5. Experimental Results
This section presents the performance of the competing methods. It starts by briefly describing the human annotation for test set generation, followed by the baseline methods. It then thoroughly elaborates the parameter selection process for DiffQue, comparative evaluation of the competing methods and other important properties of DiffQue. All computations were carried out on single Ubuntu machine with GB RAM and GHz CPU with cores.
5.1. Test Set Generation
For each dataset mentioned in Section 3, we prepared a test set for evaluating the competing models as follows. Three experts777All of them also helped us validating our hypotheses as discussed in Section 4.3. were given 1000 question pairs to start the annotation process. While annotating, experts were shown pairs of questions along with their description. We had taken complete care of keeping metadata information (such as the information of the questioners, upvote count of answers, etc.) secret. All the three experts were told to separately mark each pair of questions (, ) in terms of their relative difficulty: is tougher than , is easier than or difficult to predict. The moderator stopped the annotation process once pairs were annotated for each data such that all three annotators agreed upon their annotations (i.e., the complete consensus was achieved for each pair). The annotators ended up annotating , , and pairs to reach a consensus on pairs for SO1, SO2, SO3 and MSE respectively. The inter-annotator agreement using Fleiss’ kappa is . We used total such annotated pairs from four datasets for comparative evaluation.
5.2. Baseline Models and Evaluation Measures
We consider four baselines described below; first three are existing methods (see Section 2.2) and the last one is designed by us:
Trueskill: The approach was proposed by Liu et al. (Liu et al., 2013). It creates a user network which takes into account elements such as the question (pseudo user), question asker, best answerer user and each of the non-best answerers users. For each question, it creates four competitions/pairs: one between pseudo user U and asker U, one between pseudo user U and the best answerer U, one between the best answerer U and asker U and one between the best answerer U and each of the non-best answerers of the question Q. This creates a user model, and all the competitions are fed to TrueSkill, a bayesian Skill rating model which calculates rank of each node in the model. For a given pair of questions, the question with higher rank (generated by Trueskill) is considered difficult than the question with lower rank.
RCM: The Regularized Competition Model (RCM) was proposed by Wang et al. (2014a). It uses the network construction by (Liu et al., 2013). It solves the major problem of data sparseness i.e., low count of edges between questions present in (Liu et al., 2013). It assigns difficulty to every question and user. To overcome the sparsity problem, it further enhances knowledge by using textual similarity of question’s description (title, body, tags) in inferring appropriateness of difficulty assigned. So, a pair of textually similar question should be of similar difficulty. It then proposes a regularized competition model to assign a difficulty score. The algorithm tries to minimize loss on two fronts. First, if there is an edge in network between two nodes, then their should be at least difference between its difficulties. Second, if two questions are similar as per their textual descriptions, then the difference in their difficulties should be minimum. It forms a vector where initial entries are expertise of users and next entries are difficulty of questions. A competition set is defined as a set consisting of all edges in the network. The algorithm essentially tries to minimize: where (i,j) denotes an edge in network and . It also tries to minimize for every question pair (i,j) where indicates similarity between question description. The algorithm starts at initial and proceeds towards negative subgradient, = - *(), where () is the subgradient and is set as 0.001.
PageRank: The approach was proposed by Yang et al. (2008). It exploits the concept of acceptance of answer in CQA Services in constructing the model. It constructs the model in a way such that there is a directed edge from the question(s) where a particular user’s answer is accepted to each question where his/her answer is not accepted. Suppose user B answers questions Q1, Q2 and Q3 wherein answer to Q1 is only accepted. Then there will be a directed edge from Q2 to Q1, and Q3 to Q1. PageRank is then applied on the generated graph to get a score for each node (question). For a given pair of questions, a question with higher PageRank score indicates a higher difficulty level than the question with lower PageRank score.
HITS: We further propose a new baseline as follows – we run HITS algorithm (Kleinberg, 1999b) on our network and rank the questions globally based on their authoritativeness. We use authority score instead of hub score to evaluate the difficulty of question based on the node itself. The importance of links is captured in construction of network. Now given a pair of questions, we mark the one as more difficult whose authoritativeness is higher.
We measure the accuracy of the methods based on F1 score and Area under the ROC curve (AUC).
5.3. Parameter Selection for DiffQue
There are three important parameters of DiffQue: (i) bucket size for determining question posting time, (ii) recency of questions for type 2 edges () and (iii) classifier for edge directionality prediction. Table 5 shows F1 score of DiffQue by varying the parameter values on SO3 dataset888The pattern was same for the other datasets and therefore not reported here.. We vary the bucket size from to and observe that -week bucket interval produces the best result. is varied from to and gives the highest accuracy. We use seven classifiers: SVM, Decision Tree (DT), Naive Bayes (NB), K-Nearest Neighbors (KNN), Multilayer Perceptron (MLP), Random Forest(RF), and Gradient Boosting (GB) with suitable hyper-parameter optimization; among them SVM (with RBF kernel) produces the best accuracy. Therefore, unless otherwise stated, we use the following parameters as default for DiffQue: bucket size=2, and SVM classifier.
|(a) Bucket size|
|Dataset||Underlying||F1 score (%)||AUC (%)|
5.4. Comparative Evaluation
Table 6 shows the accuracy of all the competing methods across different datasets (see the boldface values only). Irrespective of the dataset and the evaluation measures, DiffQue outperforms all the baselines by a significant margin. DiffQue achieves a consistent performance of more than 70% F1 score and 70% AUC across different datasets. However, none of the single baseline stands as the best baseline consistently across all the datasets (see the red numbers in Table 6). DiffQue beats the best baseline by 28.77% (28.72%), 22.79% (23.03%), 46.16% (44.92%) and 21.90% (22.54%) higher F1 score (AUC) for SO1, SO2, SO3 and MSE datasets respectively. We also observe that only in SO3, Trueskill beats RCM. Trueskill is a Bayesian skill rating model, where skill level of pseudo-user (based upon the network construction) is modeled by a normal distribution and updated according to Bayes theorem. SO3, the Stack Overflow data corresponding to the most recent time frame, consists of most amount of data, in terms of both questions and users. Therefore, normal distribution may be able to capture the skill level of pseudo-user better in comparison to the optimization problem of RCM.
Since all the baselines first create a network and run their difficulty estimation modules on that network, one may argue how efficient our network construction mechanism is as compared to other mechanisms. To check this, we further consider each such network (obtained from each competing method) separately, run the difficulty estimation module suggested by each competing method and report the accuracy in Table 6. For instance, the value reported in third row and first column (i.e., ) indicates the accuracy of RCM on the network constructed by PageRank. We once again notice that in most cases, the competing methods perform the best with DiffQue’s network. This indeed shows the superiority of our network construction mechanism.
We also compare DiffQue with two learning-to-rank methods – RankBoost (Freund et al., 2003) and RankSVM (Lee and Lin, 2014; Joachims, 2002; Herbrich et al., 1999), and a modified versions of PageRank and HITS (we call them Modified PageRank (Kumar et al., 2011) and Modified HITS (Liu et al., 2012) respectively). The performance of these methods in terms of F1 and AUC is reported in Table 7. Once again, we observe that DiffQue significantly outperforms all these baselines across all the datasets.
|Modified PageRank||F1 score||62.53||58.25||58.90||60.52|
|Modified HITS||F1 score||61.46||60.54||59.81||61.21|
5.5. Feature and Hypothesis Importance
We also measure the importance of each feature for DiffQue with default configuration. For this, we drop each feature in isolation and measure the accuracy of DiffQue. Figure 4(a) shows that the maximum decrease in accuracy (27.63% decrease in F1) is observed when we drop leader follower ranking (F1 and F2), followed by PageRank (F3 and F4) and degree (F5 and F6). However, there is no increase in accuracy if we drop any feature, indicating that all features should be considered for this task.
One would further argue on the importance of keeping all hypotheses (all edge types in the network). For this, we conduct similar type of experiment – we drop each edge type (by ignoring its corresponding hypothesis), reconstruct the network and measure the accuracy of DiffQue. Figure 4(b) shows similar trend – dropping of each edge type decreases the accuracy, which indicates that all types of edges are important. However, edge type 2 (Hypothesis 2) seems to be more important (dropping of which reduces 26.31% F1 score), followed by type 1 and type 3 edges.
5.6. Robustness under Noise
We further test the robustness of DiffQue by employing two types of noises into the network: (i) Noise 1, where we keep inserting of the existing edges randomly into the network, thus increasing the size of the training set, and (ii) Noise 2, where we first remove of the existing edges and randomly insert of the edges into the network, thus the size of the training set remaining same after injecting the noise. We vary from to (with the increment of ). We hypothesize that a robust method should be able to tolerate the effect of noise up to a certain level, after which its performance should start deteriorating. Figure 5 shows the effect of both types of noises on the performance of DiffQue, RCM (the best baseline), and Trueskill. DiffQue seems to perform almost the same until 5% of both types of noises injected into the training set. However, the behavior of RCM is quite abnormal – sometime its performance improves with the increase of noise level (Figure 5(a)). Trueskill also follows the same pattern as RCM. Nevertheless, we agree that DiffQue is not appropriately resilient to noise as it does not seem to tolerate even less amount of noise (which opens further scope of improvement); however its performance does not seem to be abnormal.
5.7. Capability of Domain Adaptation
Another signature of a robust system is that it should be capable of adopting different domains, i.e., it should be able to get trained on one dataset and perform on another dataset. To verify this signature, we train DiffQue and RCM (the best baseline) on one dataset and test them on another dataset. We only consider those users which are common in both training and test sets. For SO datasets, it is fairly straightforward; however when we choose cross-domain data for training and testing (MSE as training and SO as testing and vice versa), we use the ‘account id’ of users as a key to link common user999SO and MSE come under Stack exchange (SE) (https://stackexchange.com/). Therefore, each user has to maintain a common ‘account id’ to participate in all CQA services under SE.. Table 8 shows that the performance of DiffQue almost remains the same even when the domains of training and test sets are different.
Interestingly, we observe that DiffQue performs better while SO1 and SO2 are used for training and SO3 for testing, in comparison to both training and testing on SO3. The reason may be as follows. In SO3, the number of type 1 (related to hypothesis 1) and type 2 (related to hypothesis 2) edges is less than that in SO1 and SO2. Note that as shown in Figure 4(b) that hypotheses 1 and 2 are the most important hypotheses. Therefore, the lack of two crucial edge types in SO3 may not allow the model to get trained properly. This may be the reason that our model does not perform better while trained on SO3.
Table 9 further shows a comparison of DiffQue with RCM which also claims to support domain adaptation (Liu et al., 2013). As the network topology and metadata have significant impact on question difficulty, we observe that such factors are domain independent, and our algorithm captures them well and thus can be adapted across domains. Tables 10, 11, 12 show the capability of domain adaptation of three baselines – Trueskill, Pagerank and HITS respectively.
5.8. Handling Cold Start Problem
Most of the previous research (Yang et al., 2008; Liu et al., 2013) deal with well-resolved questions which have received enough attention in terms of the number of answers, upvotes etc. However, they suffer from the ‘cold start’ problem – they are inefficient to handle newly posted questions with no answer and/or the users who posted the question are new to the community. In many real-world applications such as question routing and incentive mechanism design, however, it is usually required that the difficulty level of a question is known instantly after it is posted (Wang et al., 2014a).
RCM model leverages the textual content to handle cold-start problem. It uses a boolean term weighting to represent each question using a feature vector. To determine the similar questions, RCM measures Jaccard coefficient between feature vectors and selects top K nearest neighbors. DiffQue also handles the cold start problem by exploiting the textual description of questions. In our network construction, a question which has been asked by a completely new user forms an isolated node; therefore none of the network related features (such as F1-F6) can be computed for that question. Moreover, if it does not receive any accepted answer, other features (F7-F12) cannot be measured. We call those questions “brand-new” for which none of the features can be computed, leading to extreme level of cold start. Given a question pair () under inspection, there can be two types of cold start scenarios: (i) both and are brand new, (ii) either or is brand-new. To handle the former, we run Doc2Vec (Le and Mikolov, 2014), a standard embedding technique on textual description of all the questions and return, for each of and , most similar questions (based on cosine similarity) which are not brand-new. For (resp. ), the most similar questions are denoted as (resp. ). We assume each such set as the representative of the corresponding brand-new question. We then run DiffQue to measure the relative difficulty of each pair (one taken from and other take from ). Finally, we consider that brand-new question to be difficult whose corresponding neighboring set is marked as difficult maximum times by DiffQue. The latter cold start scenario is handle in a similar way. Let us assume that is old and is brand-new. For , we first extract most similar questions using the method discussed above. We then run DiffQue to measure the relative difficulty between and each of , and mark the one as difficult which wins the most.
To test the efficiency of our cold start module, we remove annotated pairs (edges and their associated nodes) randomly from the network. These pairs form the test set for the cold start problem. Figure 6 shows that with the increase of , DiffQue always performs better than RCM (the only baseline which handles cold start problem (Wang et al., 2014a)), indicating DiffQue’s superiority in tackling cold start problem.
6. System Description
We have designed an experimental version of DiffQue. Figure 7 shows the user interface of DiffQue. Upon entering into the site, users need to put the link of two questions and press the ‘Submit’ button. The system will then show which question is more difficult. The current implementation only accepts questions from Stack Overflow. However, it will be further extended for other (and across) CQA services.
DiffQue also supports online learning. There is a ‘Reject’ button associated with answer DiffQue produces, upon clicking of which the feedback will be forwarded to the back-end server. DiffQue has the capability to be trained incrementally with the feedback provided by the users (Shilton et al., 2005). However, special care has been taken to make the learning process robust by ignoring spurious feedback. An attacker may want to pollute DiffQue by injecting wrong feedback. DiffQue handles these spurious feedback by first checking the confidence of the current model on labels associated with the feedback, and ignoring them if the confidence is more than a pre-selected threshold (currently set as 0.75). This makes DiffQue more robust under adversarial attacks.
7. Estimating Overall difficulty
One may further be interested to assign a global difficulty score for each question, instead of predicting which one is more difficult between two questions. In this case, questions could be labeled as several difficulty levels (e.g., easy, medium, hard). We argue that solving this problem is computationally challenging as it may lack sufficient training samples, and it also needs an understanding of the difficulty levels of every pair of questions. Even if we design such system, its evaluation would be extremely challenging – it is relatively easy for a human annotator to find the difficult question from a pair compared to judging the overall difficulty score of each question.
However, here we attempt to design such system which can label each question as ‘easy’, ‘medium’ or ‘hard’. Our hypothetical system would construct the complete network (ignoring the directionality) such that every node pair is connected by an undirected edge, and the directionality of every edge is predicted by DiffQue. However, for a complete network with nodes, there are edges which, for a large value of , would require huge memory to store. We therefore approximate this procedure by picking a random pair of nodes at a time and predicting the directionality of the edge connecting them using DiffQue. We repeat this process times to construct a semi-complete network as permitted by the memory limitation of subsequent computations. We then compute PageRank of nodes on the constructed network.
Estimation of difficulty level based on PageRank value requires suitable thresholding. For this, we chose questions randomly from SO3 dataset and asked human annotators to annotate their difficulty level. The inter-annotator agreement using Fleiss’ kappa is ; such a low value once again implies that even human annotators showed strong disagreement in labeling global difficulty level of questions. However, for further processing, we took such questions for which at least two annotators agreed on the difficulty level. Experiments are conducted using 10-fold cross validation where the training set is used for determining the thresholds. Table 13 shows that DiffQue outperforms other baselines for all difficulty levels.
In this paper, we proposed DiffQue to address the problem of estimating relative difficulty of a pair of questions in CQA services. DiffQue leverages a novel network structure and estimates the relative difficulty of questions by running a supervised edge directionality prediction model. DiffQue turned out to be highly efficient than four state-of-the-art baselines w.r.t. the accuracy, robustness and capability of handling cold state problem. DiffQue can further be used to obtain an overall ranking of all questions based on the difficulty level. We have made the code and dataset available for reproducibility.
The work was partially funded by DST, India (ECR/2017/00l691) and Ramanujan Fellowship. The authors also acknowledge the support of the Infosys Centre of AI, IIIT-Delhi, India.
- Arumae et al. (2018) Kristjan Arumae, Guo-Jun Qi, and Fei Liu. 2018. A Study of Question Effectiveness Using Reddit” Ask Me Anything” Threads. arXiv preprint arXiv:1805.10389 (2018).
- Baeza-Yates et al. (2011) Ricardo Baeza-Yates, Berthier de Araújo Neto Ribeiro, et al. 2011. Modern information retrieval. New York: ACM Press; Harlow, England: Addison-Wesley,.
- Bouguessa et al. (2008) Mohamed Bouguessa, Benoît Dumoulin, and Shengrui Wang. 2008. Identifying Authoritative Actors in Question-answering Forums: The Case of Yahoo! Answers. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’08). 866–874.
- Burges et al. (2005) Christopher Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Gregory N Hullender. 2005. Learning to rank using gradient descent. In Proceedings of the 22nd International Conference on Machine learning (ICML-05). 89–96.
- Burges et al. (2007) Christopher J Burges, Robert Ragno, and Quoc V Le. 2007. Learning to rank with nonsmooth cost functions. In Advances in neural information processing systems. 193–200.
- Campbell and Treude (2017) Brock Angus Campbell and Christoph Treude. 2017. NLP2Code: Code Snippet Content Assist via Natural Language Tasks. arXiv preprint arXiv:1701.05648 (2017).
- Cao et al. (2007) Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. 2007. Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th international conference on Machine learning. ACM, 129–136.
- Cao et al. (2018) Zherui Cao, Yuan Tian, Tien-Duy B Le, and David Lo. 2018. Rule-based specification mining leveraging learning to rank. Automated Software Engineering (2018), 1–30.
- Cardon et al. (2011) Dominique Cardon, Guilhem Fouetillou, and Camille Roth. 2011. Two Paths of Glory-Structural Positions and Trajectories of Websites within Their Topical Territory. In ICWSM. 58–65.
- Chang and Pal (2013) Shuo Chang and Aditya Pal. 2013. Routing questions for collaborative answering in community question answering. In Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining. ACM, 494–501.
- Chen et al. (2017) Chunyang Chen, Zhenchang Xing, and Yang Liu. 2017. By the Community & For the Community: A Deep Learning Approach to Assist Collaborative Editing in Q&A Sites. Proc. ACM Hum.-Comput. Interact. 1, CSCW (2017), 32:1–32:21.
- Cheng et al. (2018) Minhao Cheng, Ian Davidson, and Cho-Jui Hsieh. 2018. Extreme Learning to Rank via Low Rank Assumption. In International Conference on Machine Learning. 950–959.
- Chowdhury and Chakraborty (2019) Tanya Chowdhury and Tanmoy Chakraborty. 2019. CQASUMM: Building References for Community Question Answering Summarization Corpora. In Proceedings of the ACM India Joint International Conference on Data Science and Management of Data (CoDS-COMAD ’19). ACM, New York, NY, USA, 18–26. https://doi.org/10.1145/3297001.3297004
- Chowdhury et al. (2018) Tanya Chowdhury, Aashay Mittal, and Tanmoy Chakraborty. 2018. VIZ-Wiki: Generating Visual Summaries to Factoid Threads in Community Question Answering Services. In Companion of the The Web Conference 2018 on The Web Conference 2018. International World Wide Web Conferences Steering Committee, 231–234.
- Chu and Ghahramani (2005a) Wei Chu and Zoubin Ghahramani. 2005a. Gaussian processes for ordinal regression. Journal of machine learning research 6, Jul (2005), 1019–1041.
- Chu and Ghahramani (2005b) Wei Chu and Zoubin Ghahramani. 2005b. Preference learning with Gaussian processes. In Proceedings of the 22nd international conference on Machine learning. ACM, 137–144.
- de Souza et al. (2014) Lucas BL de Souza, Eduardo C Campos, and Marcelo de A Maia. 2014. Ranking crowd knowledge to assist software development. In ACM ICPC. 72–82.
- Deerwester et al. (1990) Scott Deerwester, Susan T Dumais, George W Furnas, Thomas K Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. Journal of the American society for information science 41, 6 (1990), 391–407.
- Fattah et al. (2007) Salmah Fattah, Siti Hasnah Tanalol, and Mazlina Mamat. 2007. Classification of Examination Questions Difficulty Level Based on Bloom’s Taxonomy. In Regional Conference on Computational Science and Technology. 204–207.
- Freund et al. (2003) Yoav Freund, Raj Iyer, Robert E Schapire, and Yoram Singer. 2003. An efficient boosting algorithm for combining preferences. Journal of machine learning research 4, Nov (2003), 933–969.
- Grewal and Grewal (2001) B.S. Grewal and J.S. Grewal. 2001. Higher Engineering Mathematics. Khanna Publishers. https://books.google.co.in/books?id=73CJtwAACAAJ
- Guo et al. (2013) Fangjian Guo, Zimo Yang, and Tao Zhou. 2013. Predicting link directions via a recursive subgraph-based ranking. Physica A 392, 16 (2013), 3402–3408.
- Guo et al. (2009) Lei Guo, Enhua Tan, Songqing Chen, Xiaodong Zhang, and Yihong Eric Zhao. 2009. Analyzing patterns of user content generation in online social networks. In ACM SIGKDD. 369–378.
- Gyöngyi et al. (2004) Zoltán Gyöngyi, Hector Garcia-Molina, and Jan Pedersen. 2004. Combating web spam with trustrank. In Proceedings of the Thirtieth international conference on Very large data bases-Volume 30. VLDB Endowment, 576–587.
- Hassan et al. ([n.d.]) Sk Adnan Hassan, Dipto Das, Anindya Iqbal, Amiangshu Bosu, Rifat Shahriyar, and Toufique Ahmed. [n.d.]. SOQDE: A Supervised Learning based Question Difficulty Estimation Model for Stack Overflow. ([n. d.]).
- He et al. (2018) Kun He, Fatih Cakir, Sarah Adel Bargal, and Stan Sclaroff. 2018. Hashing as tie-aware learning to rank. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4023–4032.
- Herbrich et al. (1999) Ralf Herbrich, Thore Graepel, and Klaus Obermayer. 1999. Support vector learning for ordinal regression. (1999).
- Hong and Shen (2009) Dan Hong and Vincent Y Shen. 2009. Online user activities discovery based on time dependent data. In IEEE CSE, Vol. 4. 106–113.
- Horstmann (2016a) Cay S. Horstmann. 2016a. Core Java Volume I—Fundamentals. Prentice Hall. https://www.safaribooksonline.com/library/view/core-java-volume/9780134177335/
- Horstmann (2016b) Cay S. Horstmann. 2016b. Core Java Volume II—Advanced Features. Prentice Hall. https://www.safaribooksonline.com/library/view/core-java-volume/9780134177878/
- Hu et al. (2015) Weishu Hu, Haitao Zou, and Zhiguo Gong. 2015. Temporal pagerank on social networks. In WISE. 262–276.
- Jeon et al. (2005) Jiwoon Jeon, W. Bruce Croft, and Joon Ho Lee. 2005. Finding Similar Questions in Large Question and Answer Archives. In Proceedings of the 14th ACM International Conference on Information and Knowledge Management (CIKM ’05). 84–90.
- Joachims (2002) Thorsten Joachims. 2002. Optimizing search engines using clickthrough data. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 133–142.
- Kendal (2009) S. Kendal. 2009. Object Oriented Programming using Java. Ventus Publishing ApS. https://books.google.co.in/books?id=HWC4ZjXs1V8C
- Kleinberg (1999a) Jon M Kleinberg. 1999a. Authoritative sources in a hyperlinked environment. Journal of the ACM (JACM) 46, 5 (1999), 604–632.
- Kleinberg (1999b) Jon M Kleinberg. 1999b. Authoritative sources in a hyperlinked environment. JACM 46, 5 (1999), 604–632.
- Kreyszig et al. (2011) Erwin Kreyszig, Herbert Kreyszig, and E. J. Norminton. 2011. Advanced Engineering Mathematics. Wiley.
- Kumar et al. (2011) Gyanendra Kumar, Neelam Duhan, and AK Sharma. 2011. Page ranking based on number of visits of links of web page. In 2nd International Conference on Computer and Communication Technology. IEEE, 11–14.
- Le and Mikolov (2014) Quoc Le and Tomas Mikolov. 2014. Distributed Representations of Sentences and Documents. In ICML. 1188–1196.
- Lee and Lin (2014) Ching-Pei Lee and Chih-Jen Lin. 2014. Large-scale linear ranksvm. Neural computation 26, 4 (2014), 781–817.
- Lempel and Moran (2001) Ronny Lempel and Shlomo Moran. 2001. SALSA: the stochastic approach for link-structure analysis. ACM Transactions on Information Systems (TOIS) 19, 2 (2001), 131–160.
- Liu et al. (2011) Jing Liu, Young-In Song, and Chin-Yew Lin. 2011. Competition-based user expertise score estimation. In ACM SIGIR. 425–434.
- Liu et al. (2013) Jing Liu, Quan Wang, Chin-Yew Lin, and Hsiao-Wuen Hon. 2013. Question Difficulty Estimation in Community Question Answering Services. In EMNLP. ACL, 85–90.
- Liu et al. (2009) Tie-Yan Liu et al. 2009. Learning to rank for information retrieval. Foundations and Trends® in Information Retrieval 3, 3 (2009), 225–331.
- Liu et al. (2012) Xinyue Liu, Hongfei Lin, and Cong Zhang. 2012. An Improved HITS Algorithm Based on Page-query Similarity and Page Popularity. JCP 7, 1 (2012), 130–134.
- Liu et al. (2018) Xialei Liu, Joost van de Weijer, and Andrew D Bagdanov. 2018. Leveraging unlabeled data for crowd counting by learning to rank. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7661–7669.
- Liu et al. (2008) Yuting Liu, Bin Gao, Tie-Yan Liu, Ying Zhang, Zhiming Ma, Shuyuan He, and Hang Li. 2008. BrowseRank: Letting Web Users Vote for Page Importance. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’08). 451–458.
- LÃ¼ and Zhou (2011) Linyuan LÃ¼ and Tao Zhou. 2011. Link prediction in complex networks: A survey. Physica A 390, 6 (2011), 1150 – 1170.
- Malewicz et al. (2010) Grzegorz Malewicz, Matthew H. Austern, Aart J.C Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. 2010. Pregel: A System for Large-scale Graph Processing. In SIGMOD. 135–146.
- Nasehi et al. (2012) Seyed Mehdi Nasehi, Jonathan Sillito, Frank Maurer, and Chris Burns. 2012. What makes a good code example?: A study of programming Q&A in StackOverflow. In ICSM. 25–34.
- Nguyen et al. (2018) Thanh Ngan Nguyen, Minh Trang Nguyen, and Thanh Hai Dang. 2018. Disease Named Entity Normalization Using Pairwise Learning To Rank and Deep Learning. Technical Report. VNU University of Engineering and Technology.
- Page et al. (1999) Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. The PageRank citation ranking: Bringing order to the web. Technical Report. Stanford InfoLab.
- Pahikkala et al. (2009) Tapio Pahikkala, Evgeni Tsivtsivadze, Antti Airola, Jouni Järvinen, and Jorma Boberg. 2009. An efficient algorithm for learning to rank from preference graphs. Machine Learning 75, 1 (2009), 129–165.
- Pal and Counts (2011) Aditya Pal and Scott Counts. 2011. Identifying Topical Authorities in Microblogs. In Proceedings of the Fourth ACM International Conference on Web Search and Data Mining (WSDM ’11). 45–54.
- Pal et al. (2012) Aditya Pal, F. Maxwell Harper, and Joseph A. Konstan. 2012. Exploring Question Selection Bias to Identify Experts and Potential Experts in Community Question Answering. ACM Trans. Inf. Syst. (2012), 10:1–10:28.
- Pecinovsky (2013) R. Pecinovsky. 2013. OOP - Learn Object Oriented Thinking & Programming:. Tomas Bruckner. https://books.google.co.in/books?id=xb-sAQAAQBAJ
- Qin et al. (2007) Tao Qin, Xu-Dong Zhang, De-Sheng Wang, Tie-Yan Liu, Wei Lai, and Hang Li. 2007. Ranking with multiple hyperplanes. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 279–286.
- Rijke (2019) Maarten de Rijke. 2019. Reinforcement Learning to Rank. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining. ACM, 5–5.
- Rosen (2003) K.H. Rosen. 2003. Discrete mathematics and its applications. McGraw-Hill.
- Ross (2000) Sheldon M. Ross. 2000. Introduction to probability and statistics for engineers and scientists (2. ed.). Academic Press. I–XIV, 1–578 pages.
- San Pedro and Karatzoglou (2014) Jose San Pedro and Alexandros Karatzoglou. 2014. Question Recommendation for Collaborative Question Answering Systems with RankSLDA. In ACM RecSys. 193–200.
- Sasaki et al. (2018) Shota Sasaki, Shuo Sun, Shigehiko Schamoni, Kevin Duh, and Kentaro Inui. 2018. Cross-lingual learning-to-rank with shared representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). 458–463.
- Schildt (2014) H. Schildt. 2014. Java: The Complete Reference, Ninth Edition. McGraw-Hill Education. https://books.google.co.in/books?id=tvbamgEACAAJ
- Shilton et al. (2005) Alistair Shilton, Marimuthu Palaniswami, Daniel Ralph, and Ah Chung Tsoi. 2005. Incremental training of support vector machines. IEEE transactions on neural networks 16, 1 (2005), 114–131.
- Soundarajan and Hopcroft (2012) Sucheta Soundarajan and John E Hopcroft. 2012. Use of Supervised Learning to Predict Directionality of Links in a Network. In ADMA. 395–406.
- Stroud and Booth (2003) K.A. Stroud and D.J. Booth. 2003. Advanced Engineering Mathematics. Industrial Press. https://books.google.co.in/books?id=5LKXAAAACAAJ
- Tsai et al. (2007) Ming-Feng Tsai, Tie-Yan Liu, Tao Qin, Hsin-Hsi Chen, and Wei-Ying Ma. 2007. FRank: a ranking method with fidelity loss. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 383–390.
- Vasilescu et al. (2014) Bogdan Vasilescu, Alexander Serebrenik, Prem Devanbu, and Vladimir Filkov. 2014. How Social Q&A Sites Are Changing Knowledge Sharing in Open Source Software Communities. In ACM CSCW. New York, NY, USA, 342–354.
- Wang et al. (2014a) Quan Wang, Jing Liu, Bin Wang, and Li Guo. 2014a. A Regularized Competition Model for Question Difficulty Estimation in Community Question Answering Services.. In EMNLP. ACL, 1115–1126.
- Wang et al. (2014b) Shaowei Wang, David Lo, Bogdan Vasilescu, and Alexander Serebrenik. 2014b. Entagrec: An enhanced tag recommendation system for software information sites. In IEEE ICSME. 291–300.
- Wang et al. (2015) Xiaojie Wang, Xue Zhang, Chengli Zhao, Zheng Xie, Shengjun Zhang, and Dongyun Yi. 2015. Predicting link directions using local directed path. Physica A 419 (2015), 260–267.
- Weng et al. (2010) Jianshu Weng, Ee-Peng Lim, Jing Jiang, and Qi He. 2010. TwitterRank: Finding Topic-sensitive Influential Twitterers. In Proceedings of the Third ACM International Conference on Web Search and Data Mining. 261–270.
- Xue et al. (2008) Xiaobing Xue, Jiwoon Jeon, and W Bruce Croft. 2008. Retrieval models for question and answer archives. In ACM SIGIR. 475–482.
- Yang et al. (2008) Jiang Yang, Lada A Adamic, and Mark S Ackerman. 2008. Competing to Share Expertise: The Taskcn Knowledge Sharing Community. In ICWSM. 161–168.
- Yang and Leskovec (2011) Jaewon Yang and Jure Leskovec. 2011. Patterns of temporal variation in online media. In ACM WSDM. 177–186.
- Yang et al. (2013) Liu Yang, Minghui Qiu, Swapna Gottipati, Feida Zhu, Jing Jiang, Huiping Sun, and Zhong Chen. 2013. Cqarank: jointly model topics and expertise in community question answering. In ACM CIKM. 99–108.
- Zhai (2008) ChengXiang Zhai. 2008. Statistical language models for information retrieval. Synthesis Lectures on Human Language Technologies 1, 1 (2008), 1–141.
- Zhou et al. (2009) Yanhong Zhou, Gao Cong, Bin Cui, Christian S Jensen, and Junjie Yao. 2009. Routing questions to the right users in online communities. In IEEE ICDE. 700–711.