Traceability in the Wild:
Automatically Augmenting Incomplete Trace Links
Software and systems traceability is widely accepted as an essential element for supporting many software development tasks. Today’s version control systems provide inbuilt features that allow developers to tag each commit with one or more issue ID, thereby providing the building blocks from which project-wide traceability can be established between feature requests, bug fixes, commits, source code, and specific developers. However, our analysis of six open source projects showed that on average only 60% of the commits were linked to specific issues. Without these fundamental links the entire set of project-wide links will be incomplete, and therefore not trustworthy. In this paper we address the fundamental problem of missing links between commits and issues. Our approach leverages a combination of process and text-related features characterizing issues and code changes to train a classifier to identify missing issue tags in commit messages, thereby generating the missing links. We conducted a series of experiments to evaluate our approach against six open source projects and showed that it was able to effectively recommend links for tagging issues at an average of 96% recall and 33% precision. In a related task for augmenting a set of existing trace links, the classifier returned precision at levels greater than 89% in all projects and recall of 50%.
Traceability provides support for many different software engineering activities including safety analysis, change impact analysis, test regression selection, and coverage analysis (Gotel et al., 2012; Gotel and Finkelstein, 1994; Ramesh and Jarke, 2001; Mäder and Cleland-Huang, 2013; Cleland-Huang et al., 2014; Mäder and Cleland-Huang, 2015). Its importance has long been recognized in safety-critical domains, where it is often a prescribed part of the development process (Cleland-Huang et al., 2012; Mäder et al., 2013; Rempel et al., 2014; Regan et al., 2014; Rempel and Mäder, 2016). While traceability is relevant to all software development environments (Ramesh and Jarke, 2001; Mäder et al., 2013; Mäder and Egyed, 2015; Mäder et al., 2017; Ståhl et al., 2017; Rempel and Mäder, 2017), the effort needed to manually establish and maintain trace links in non-regulated domains has often been perceived as prohibitively high.
However, with the ubiquitous adoption of version control systems such as Git (Git SCM, [n. d.]) and GitHub, and issue tracking systems such as Bugzilla or Jira (JIRA, [n. d.]), it has become common practice for developers to tag commits with issue IDs. In large projects, such as the ones from the Apache Foundation, this procedure is reflected in the guidelines which state that “You need to make sure that the commit message contains at least […] a reference to the Bugzilla or JIRA issue […]” (Foundation, [n. d.]). Creating such tags establishes explicit links between commits and issues, such as feature requests and bug reports. However, the process is not perfect, as developers may forget, or otherwise fail, to create tags when they make a commit (Romo et al., 2014; Bachmann and Bernstein, 2009). While the practice of tagging commits has become popular in open source projects, it is conceptually applicable in any project where version control systems and issue trackers are used.
In this paper we propose a solution for identifying tags that are missing between commits and issues and augmenting the traceability data with these previously missing links. As shown later in the paper, our observations across six OSS showed that an average of only about 60% of commits were linked to specific issues. The majority of papers addressing traceability in OSS have focused on directly establishing a complete set of links between issues and source code. In contrast, we focus on generating the missing links at the commit level. This has the primary advantage of providing traceability support within the natural context in which developers are creating trace links. Our approach leverages existing tags, as well as information related to the commit process itself and also textual similarities between commit messages, issue descriptions, and code changes. We use these attributes to train a classifier to identify tags that are missing from commit messages. Furthermore, we set a critical constraint on our work that the classifier must be populated, trained, and then utilized with a simple “button press” in order to make it practical in an industrial setting.
Low level links between commits and issues provide the building blocks for inferring project-wide traceability between improvements, bug reports, source code, test cases, and commits, and also allow associations to be established between the issues and developers (Seiler and Paech, 2017). Augmenting the set of trace links between commits and issues, therefore results in a more complete set of project-wide trace links. This enables more accurate support for tasks such as defect prevention (Rempel and Mäder, 2017), change impact analysis, coverage analysis, and even provides enhanced support for building recommendation systems to identify appropriate developers for fixing bugs (Anvik et al., 2006).
We train and evaluate our approach on six open-source projects in order to address three key research questions:
RQ1: Is the link classifier able to accurately reconstruct issue tags during the commit process?
RQ2: Is the link classifier able to precisely augment an existing set of incomplete commit to issue links in a fully automated way?
RQ3: Is the link classifier able to recommend additional tags?
The remainder of the paper is structured as follows. Section 2 introduces the artifacts, case projects, process model, and stakeholder model, that form the fundamentals of our approach. Section 3 describes the elements of our classifier. Section 4 describes the six projects in our study. Sections 5 and 6 describe scenarios and experiments associated with recommending tags for commits, augmenting existing sets of trace links, and constructing trace links for commits with no tags. Finally sections 7 to 9 discuss related work, threats to validity, and conclusions.
We first introduce a motivating example and describe the artifacts, project environments, and the process and stakeholder models that form the fundamentals of our approach.
2.1. Motivating Example
Figure 1 depicts the improvement request, GROOVY-5223,111https://issues.apache.org/jira/browse/GROOVY-5223 retrieved from the Groovy project’s issue tracker, JIRA (JIRA, [n. d.]). The request consists of a unique issue ID, a short summary, a longer textual description, time stamps for issue creation and resolution, the issue’s current status, and information about its resolution. This particular improvement requests an enhancement to an existing feature concerning class loading at byte code level. Figure 2 shows a bug report GROOVY-5082222https://issues.apache.org/jira/browse/GROOVY-5082 for the same project. It includes the same fields as the improvement, except that the type is specified as a bug. In this case, the bug describes a problem with byte code generation for the groovy language. Finally, Figure 3 shows an example of a commit333http://goo.gl/pBy6Nw submitted to the Git (Git SCM, [n. d.]) version control system. A commit (change set) includes a unique commit hash value, a message describing its purpose, the time stamp when it was submitted, and finally a list of files modified by the change set.
The common way to establish a trace link between a commit and an issue is by placing the unique issue id, i.e., GROOVY-5082 in this example, into the beginning of the commit message. However, a close examination of the commit message in this example shows that the committer made a subtle mistake and misspelled the issue key for the bug that was being fixed (omitting an O). As a result, traditional trace link construction techniques that rely upon matching the key to an issue will fail to create a trace link.
However, even without a valid issue key, there are numerous clues to suggest that the commit should be associated with the reported bug. First, the bug description exhibits textual similarity to the commit message as well as to the text in the changed file AsmClassGenerator.java. Second, the commit was submitted on the same date that the issue was resolved, and finally, the person (obfuscated for privacy reasons) who submitted the commit was also responsible (i. e. the assignee) for resolving the issue. Taken together, these observations provide some degree of evidence that the commit and bug should be linked. This example illuminates the thinking behind our proposed solution. We build a classifier that leverages all of this information, plus additional attributes, to learn which issues should be tagged to each specific commit.
2.2. Software Artifacts and their Relations
While version control systems and issue trackers have several types of artifacts, our approach leverages three of them to construct missing commit links. These are issues, commits (i.e., change sets), and source code files.
Issues: Our model uses issues collected from the Jira issue tracking system. While there are several types of issues, we focus on improvements and bugs which are the most commonly occurring ones. An improvement represents an enhancement to an existing feature in the software, while a bug describes a problem, which impairs or prevents its correct functionality. In the remainder of the paper, the term issue is used in reference to both improvements and bugs. Independent of their actual type, all issues share the following properties: a unique issue ID, a summary providing a brief one-line synopsis of the issue, and a more extensive explanation provided in the description. Further, every issue has a temporal life cycle – it is created at a given point in time and later resolved, and may be assigned to an author, responsible for its resolution.
Commit (Change Set): In Git version control, changes are organized as atomic commits. A commit bundles together all modified files and is uniquely identified by a hash value. It includes properties concerning the person who made the change, a time stamp (committed date), and a commit message stating the purpose of the change.
Source Code: Multiple types of files are associated with a software project including source code files, documentation, examples, and tests. In this paper we focus on source code files which are explicitly linked to commit messages. The source code files provide support for our primary goal of establishing links between commit messages and issues.
Relations: We are also interested in relations between the three fundamental types of artifacts in our model. A commit atomically bundles one or more source files together. This containment relation between commit and source code artifacts is a natural result of submitting the change set to the version control system. Further, as previously explained, trace links are explicitly created between issues and commits when a developer tags a commit with a valid issue ID. We denote an issue as linked, if there is at least one trace link from the issue to a commit. Issues without any links are termed non linked. Figure 4 depicts the three artifacts as well as their structural interactions.
We denote as the set of issues (bugs and improvements), the set of commits, and the set of source code files in a project. The function returns if an explicit link exists, and otherwise. The function with calculates this set for a given commit. A source code file may be part of multiple commits.
2.3. Studied Projects
For our study, we selected six projects from diverse domains, that utilized both Git and Jira. They included: build automation (Maven (Ma)), databases (Derby (De), Infinispan (In)), languages (Groovy (Gr), Pig (Pi)), and a rule engine (Drools (Do)), primarily selected because each of these projects has existed for several years, has a non-trivial number of commits and issues, and largely followed the practice of tagging commits with issue IDs. We analyzed each of the projects to gain an understanding of the numbers of links that existed between commit messages and issues. Further, we analyzed the number of issues that were linked to exactly one commit (1:1), two or more disjoint commits (1:N), or had no links. Results are reported in Table 1. For example, of the 2,638 bug-related issues in the Derby project, 1,093 were linked to only one commit, 273 were linked to multiple commits, and 1,272 had no associated commits. Across all of the projects approximately 43.3% of improvements and 42.4% of bugs have no commits associated with them.
Table 2 depicts a similar analysis from the perspective of the commits. It reports the number of commits with links to issues for the selected projects. Again, we analyzed the distribution of 1:1 links, 1:N links and non linked commits. In the Derby project, of the 3,735 commits, 1,657 linked to only one bug, 1,350 to only one improvement, 175 linked to multiple bugs or improvements, and 553 commits had no links. However, across all of the projects approximately 48% of the commits were not linked to any issue. Furthermore, there was significant variance across the six projects with only 15% of commits in Derby having no links compared to approximately 76% of unlinked commits in Maven. Clearly, different practices exist across different projects, leading to huge disparities in the extent to which issue tags are added to commit messages.
One of the primary goals of our work, is to establish at least one link for each commit. As the majority of commits link to a single issue, we only attempt to generate links for currently unlinked commits. For commits without links there are two viable cases – first that an appropriate issue exists and a link can be generated, and second that no appropriate issue exists for the commit.
2.4. Process Model
As previously explained, our approach leverages clues from the development process to aid in the generation of links. First we observe that the software development process is time dependent: bugs and improvements are constantly created and resolved, and commits are submitted to the version control system. Figure 5 exemplifies this scenario.
It contains six issues , nine commits , and six source code file artifacts . The issue artifacts and commits are ordered across a time line. In this example, the issues , as well as commits , and are linked, e. g. . The figure also shows the relation between issues and commits according to the time line. We define the functions and , which returns the point in time when the issue was created and respectively resolved. During this time, the issue is considered to be unfinished and source code modifications are required in order to implement the improvement or fix the bug. In our study we focus on issues that are resolved (e. g., in Figure 5, and ). The function returns the time stamp at which a commit was submitted to the version control system. e. g. in the example.
Temporal relations Considering a non linked commit , the temporal structure imposes several constraints on the possible link candidates . The following three cases exist.
: Due to causality, the commit is not considered to be a link candidate for (e.g., in Figure 5, is not a link candidate for ).
: This situation depicts the usual development work flow. After issue creation, the developers modify the source code and submit commits in order to resolve the issue. These commits are traced to the issue. Eventually, the issue is resolved, and in this example, no further commits are made to the issue (e.g., in Figure 5, the non linked commit is a link candidate for ).
: Intuitively, in this situation a trace link from to is not considered, since was already resolved before the commit occurred. However, this situation is not uncommon as Table 3 shows. The obvious reasons might be, that a developer forgot to submit the commit before resolving the issue. Another might simply be clock differences between the unconnected, decentralized systems used by Jira and Git which prevents strict time comparisons.
Project De Dr Gr In Ma Pi (1) Commits linked to already resolved issues Number 136 207 2,648 244 847 100 Median time after resolved 150h 60h 5h 19h 5h 60h (2) Avg. file commit overlap 0.35 0.35 0.71 0.33 0.40 0.45 Table 3. Properties of linked commits. (1) The distribution of commits linked to issues after issues resolution along with the median time. (2) The average file overlap of consecutive commits linked to the same issue.
In project Derby, there is sometimes a large discrepancy between the time at which an issue is resolved and the last commit that traces to it. For example, the improvement DERBY-6516444https://issues.apache.org/jira/browse/DERBY-6516 was resolved as fixed on 20/Mar/14; yet, on 4/Apr/14 a commit (78227e4555http://goo.gl/j3WYd6) was submitted and linked to this improvement. However this scenario is quite rare, affecting only 136 commits. Interestingly, in both the Groovy and Maven projects, the median time difference for late commits is much lower (only 5 hours), but affects a huge number of commits. For example, in the Maven project, we observed that between 2005 and 2015 there was a constant offset between issue resolution and corresponding commit from either five or six hours as illustrated in MNG-221666https://issues.apache.org/jira/browse/MNG-221 (from 2005), MNG-2376 (from 2008), and MNG-5245 (from 2012).
These temporal constraints limit the potential pairs of candidate links between non linked commits and non linked improvements and bugs.
Structural relations Table 2 reveals (in row 1:n) that often multiple commits are required in order to solve an issue . Ideally all of these commits are traced to the respective issue. However, often only one commit in this series is explicitly linked to . In (Schermann et al., 2015) the other commits in this series are termed phantoms. All commits in the series may share commonalities. In addition to their succession in time, the commits may modify a similar set of source code files since they are related to the same issue. We define a function with . For example, in Figure 5 the overlap of and is , and . As shown in Table 3, the average overlap of consecutive commits linked to the same issue varies among the projects. For example in Derby, the average overlap is 0.35 meaning, that, on average, one out of three files are the same for commits in a series. The highest number, 0.73, is achieved in Groovy, where a few files are changed multiple times to implement an improvement or bug. Three commits (51d4fee777http://goo.gl/4YGjhe, 3d20737888http://goo.gl/AZcwmK, and 974c945999http://goo.gl/2TYhsz) were submitted between 4/Jan/2009 and 6/Jan/2009 all modifying one and the same source code file DefaultGroovyMethods.java and linked to improvement GROOVY-3252101010https://issues.apache.org/jira/browse/GROOVY-3252. This results in for each commit pair in the series.
Based on temporal closeness and overlap, there are indications that and may belong to a series of commits and thus could be traced to . The situation may also occur forward in time, i. e. and may belong to a series because of temporal closeness and source file overlap and thus should be traced to bug .
2.5. Stakeholder Model
Issues and commit artifacts both carry information about the author. The assignee of an issue also might be the person who contributes commits in solving the issue. In the studied scenario, there is no technical connection between the issue tracker Jira and the version control system Git. Thus we cannot rely on an available stakeholder model. Therefore we applied the following approach to identify individual developers in both systems. In each system, a developer is represented by a name and a login (nickname or email). In the first step, we separately collected all developers from the two systems and built two groups. In this step, we merged names if they used the same login and therefore were aliases for the same person. In the second step, we heuristically merged the two resulting developer lists and compared the names, in order to identify the same person in both systems. In order to fully protect user privacy and to comply with Github Privacy Requirements a unique number, user id, was assigned to every developer. The function with returns this user id for a given commit or issue.
3. The Link Classifier
Our goal was to create a classifier that could identify issues associated with a commit. The classifier was therefore trained to predict whether any issue-commit pair should be linked or not.
3.1. Attributes of the Commit-Issue Relation
Based on the artifact model introduced in the previous section, we identified 18 attributes per instance. These attributes fall into two categories, process-related information and textual similarity between artifacts using information retrieval techniques.
We consider the following 16 process-related factors to model the relationship between commits, source code files, and issues. These factors capture stakeholder-related, temporal, and structural characteristics of the candidate pair with :
Stakeholder-related information, : We capture the identities of the committer as and the assignee of the issues as . Additionally, we marked as a binary attribute whether the two are identical as otherwise).
Temporal relations between issue and commit, : Based on temporal properties of issue and commit, we calculated and . Additionally, we capture as whether , i. e. whether was committed during the active development time of . Furthermore, we capture close commits in relation to issue resolution as . We set days, derived from observing that late commits occur on average within 5 and 150 hours of the issue resolution for the studied projects (see Table 3). For example in Figure 5, the pair yields , and .
Closest previous linked commit, : We capture the set of previous commits linked to as . If non-empty, the commit with the largest commit time stamp is taken and used to calculate , , and . For example in Figure 5, the pair yields and thus , , and .
Closest subsequent linked commit, : Analogous to the closest previous linked commit, we capture subsequent commits . We capture and selected with the minimal commit time to calculate , , and . For example in Figure 5, the pair yields , , , and .
Number of issues and existing links, : We calculate the set of existing issues at time ,
and capture its cardinality as . Taking , we derive representing non-resolved issues for the assignee of at that instant in time and capture its size in . With we capture the number of links to before commit . For example in Figure 5, considering pair , and thus .
Textual Similarity Attributes
We leveraged information retrieval methods to compute textual and semantic associations between commit messages, source code files, and issues. We explored three primary techniques for computing textual similarity . These were the Vector Space Model (VSM), VSM with N-Gram enhancements (VSM-nGram), and Latent Semantic Indexing (LSI) (Antoniol et al., 2002; Eaddy et al., 2008; De Lucia et al., 2004).
In the VSM model, each document, i.e., commit message, issue description, and source code file, is treated as an unstructured bag of terms. Following common information retrieval techniques, documents are pre-processed to remove stop words, to stem words to their morphological roots, and to split camel-case and snake-case words (e.g., optionsParser vs. options_parser) into their constituent parts. Each document is then represented as a vector , where represents the term weight associated with term for document . Each term is assigned a weight using a standard weighting scheme known as (Huffman Hayes et al., 2006). The cosine similarity between a pair of vectors in then computed as follows in order to estimate the similarity between two documents and and :
The N-Gram enhancement to VSM utilizes n-gram models (Witten et al., 2016; Cavnar, 1995). N-gram is a contiguous sequence of words in a document. Each document is again represented as a vector, but in this case, the vector is comprised of both the word and the n-grams it contains. The documents are preprocessed in the same way as the basic VSM. Based on initial experimentation, we set from 2 to 4, to include 2-gram, 3-gram and 4-gram sequences in the vector representations. The similarity between vectors was again calculated using the cosine measure (equation (1)) with schema as described above.
We conducted an initial comparative study of Latent Semantic Indexing (LSI) (Antoniol et al., 2002; Eaddy et al., 2008; De Lucia et al., 2004), VSM, and VSM-nGram. Based on an initial comparison of the results we selected the VSM-nGram approach for computing textual similarity scores. This was because we observed that VSM-nGram outperformed VSM on our datasets, and ran much faster than LSI. In fact, the computation time of LSI on our datasets was prohibitively slow with runtimes of up to 40 hours in some cases, and so we rejected it as impractical. Furthermore, several previous studies have shown that VSM tends to either outperform LSI on software engineering datasets or perform in equivalent ways (Huffman Hayes et al., 2006; Lucia et al., 2012; Guo et al., 2016). A detailed comparison of trace retrieval techniques within our classifier is outside the scope of this research. Therefore, based on our initial analysis, we chose VSM-nGram to compute the following similarity attributes:
Textual similarity of a commit and an issue, : The similarity between the commit message and the textual content of the issue (for both improvements and bugs) is captured as with .
Textual similarity of committed source files and an issue, : For each commit-issue pair, the textual similarity between the content of the most similar committed source code file and the textual content of the issue is captured as , .
3.2. Studied Attribute Sets
We studied the impact of the presented attributes in four subsets.
Process – This set solely contains the process-related attributes, i. e. . It studies the impact of all process-related attributes without considering textual similarity.
Similarity – This set consists of the attributes
. It solely considers textual similarity between commit and issue given the constraint that the issue existed at the time of the commit.
All – This set, , contains all process, similarity and stakeholder related attributes.
Auto – This set, , addresses potential correlations and dependencies among attributes. It contains an automatically selected subset derived by considering the individual predictive ability of each attribute along with the degree of redundancy between them. We implemented the redundant attribute removal process based on Weka’s inbuilt auto-selection feature (Hall, 1998).
3.3. Dataset Profiles and Splits
We aim to classify links between commits and improvements and between commits and bugs. We therefore construct two distinct profiles for each project, constructed from process and similarity attributes per commit-issue pair.
Each profile consists of a distinct training and a testing set. We applied the following procedure per project to create instances of candidate commit–issue pairs for the training set as well as the testing set .
with the function defined as
The function limits the number of candidate commit/issue pairs according to causality. A link candidate is never considered between a commit if the issue has not been created at the time of the commit. Secondly, assures that a commit is not unboundedly considered as a candidate for issues resolved in the past. Based on an analysis of commits onto closed issues (see Table 3), we found that the median commit time after the issue has marked resolved was between 5 and 150 hours for the studied projects and we decided to choose as 30 hours. The candidate sets and were then created as
The parameter defines the point in time, which splits the training and test set. We choose a 80% – 20% split and calculated as follows. First, we ordered the improvements in the respective project according to their creation date in ascending order. We selected the improvement , which divides this sequence into 80% and 20% of all improvements where , i. e. the resolution time of 80% of all improvements.
Each commit-issue candidate in the profiles and forms an instance to train () or test () the classifier, where the test data (i. e. 20% in every project) is distinct from the training data. For each instance, we calculate 18 attributes that characterize the relation between commit and issue. In addition, each instance is annotated with the known class (i.e. linked, or non-linked) as extracted from the projects’ data. Linked means that the developer had created an explicit tag from the commit to the issue, while non-linked means that no such tag exists.
3.4. Classifier Training
We investigated three different supervised learning classifiers for categorizing commit-issue pairs as linked or non-linked. These were Näive Bayes, J48 Decision Tree, and Random Forrest. We included Näive Bayes because even though the assumption of independence rarely holds in software project data, the algorithm has been demonstrated to be effective for solving similar prediction problems (Kim et al., 2011; D’Ambros et al., 2012; Guo et al., 2004; Falessi et al., pear). We utilized Weka’s J48 decision tree with default pruning settings because of its previously reported effectiveness in other software engineering studies (Guo et al., 2004). Finally, we included the Random Forest classifier because it has been shown to be highly accurate and robust against noise (Breiman, 2001), although it can be expensive to run on large data sets.
The profiles we created were severely unbalanced containing many more instances of non-links than links. Training against such unbalanced sets makes it likely that the classifier will favor placing instances into the majority class (i. e. in this case classifying all pairs as non-links). We performed all experiments using Weka (Hall et al., 2009) and used the inbuilt sub-sampling feature to create balanced data sets. Given a fixed number of explicit links, Weka randomly selects the same number of non-links. We trained each classifier in turn using the balanced sets , and for each project and then evaluated the classifier against the respective unbalanced testing sets and . To mitigate the random effects of sub-sampling, we repeated the training and testing 10 times and averaged the achieved results. We did not follow an ordinary 10-fold cross validation approach because several of the studied variables (e. g. attributes ) reflect temporal sequences in the development, making it necessary to ensure that temporal sequencing between training and test data was preserved. For each technique the classifier returned a category (i.e. linked or non-linked) and also a score which we used to rank recommended links in order of likelihood.
4. Data collection
To prepare the data for training and testing the classifier, we performed a two step data collection process for each of the six projects.
Step 1: Analyzing project management and issue tracker system. We implemented a collector to retrieve artifacts (i. e., improvements and bugs). All six projects use the Jira project management tool offering a web-service interface. Our collector downloaded and parsed all artifacts. Using the artifact type, we filtered the artifacts to retrieve only bugs and improvements. Therefore we applied the following mapping from Jira types to our model: bug bug, and improvement, enhancement improvement. In both cases, the artifacts represented finished work, i. e. their status was ”Resolved” or ”Closed” and the resolution ”Fixed” or ”Done”.
Step 2: Analyzing Source Control Management (SCM) system. A second collector was implemented to download all source code changes and commit messages from each SCM repository (i.e., GIT). We parsed the commit messages and applied the heuristic described in (Bachmann and Bernstein, 2009) to retrieve existing trace links from commits to bug reports and improvements based on searching and matching the issue keys in commit messages. Given the goals of our traceability experiment we excluded non-source code files related to documentation, and build automation based on their file name extensions. Additionally we analyzed file paths in order to exclude source code files implementing test cases. In the standard Maven directory layout111111https://goo.gl/D8uYaD, used by all six of our projects projects, source files are placed in sub-directories of src/main and tests as sub-directories of src/test/java.
The results of these two steps were stored in an archive per project, which is publicly available (Rath et al., 2018). Data were collected from each project until May 31th 2017.
5. Reconstructing Known Links
We performed a two-phase evaluation. In the first phase we address RQ1 and RQ2 by exploring two different usage scenarios. The first uses the classifier as a recommender system, to suggest a list of the most likely issues at the time a commit is submitted. Ideally this functionality would be integrated into the version control system and activated when the user presses the commit button. In this scenario, high recall is imperative, so that the relevant issue (if it exists) is included in the displayed list. The second experiment evaluates the case, in which the classifier is used to automatically augment an existing set of trace links for a project. In this scenario, high precision is essential because links that are automatically added must exhibit high accuracy. In experiments related to both of these scenarios, we leveraged the existing links created by the project developers from explicit commit-issue tags as an “answer set” to train and evaluate our classifiers. Both experiments therefore evaluate whether the classifier would have been able to recommend or create a known link if the committer had forgotten to create its tag manually. We trained the three classifiers on the four attribute sets as described in Section 3.2 and Section 3.4.
Results were evaluated using commonly adopted traceability metrics. Recall measures the fraction of relevant links that are retrieved while precision measures the fraction of retrieved links that are relevant. Finally, F-measure measures the harmonic mean of recall and precision (Shin et al., 2015; Huffman Hayes et al., 2006; Lohar et al., 2013a; Antoniol et al., 2002). We utilize two variants of the F-Measure – namely which is weighted to favor recall, and which is weighted to favor precision.
Precision1, Recall2, Measure3
Scenario 1: Recommending Issues to Assist Commits: The goal of this scenario is to create a short list with a maximum of three recommended links, to assist developers in tagging their new commits with issue IDs. Thus, we truncate the retrieved lists after the third rank and evaluate classifier performance in terms of precision, recall, and F2-measure at this point. F2-measure is selected because the objective of this scenario is to achieve high recall. The results for the best performing classifier Random Forest are shown in Table 4. The attribute set All achieves an average recall of 96% and average precision of 33%, which in combination marks the best performance of the studied feature sets. An application of the Mann-Whitney U Test (Mann and Whitney, 1947) shows that the All approach significantly () outperforms the other attribute sets in terms of score. The other two classifiers also performed best when using the All features sets. However, their achieved scores were significantly lower than that of Random Forest ( for J48, and for Näive Bayes). Generally, the values show that the Random Forest classifier is able to predict the one true link among the three recommended links. The attribute set Similarity exhibits the lowest measure. The feature set Process performs considerably well. This is notable, because it does not require resource intensive IR techniques to extract the necessary features and . However, adding these features to the model results in overall better performance (see the All attribute set). An exception within the results is the Derby project, which underperforms on all attribute sets. The low recall values indicate (e g., for Process) that the correct link is not in the ranked list for one out of two commits.
Precision1, Recall2, Measure3
Scenario 2: Fully Automated Augmentation of Trace links between Commits and Issues: The classifier performance for the second scenario is evaluated in terms of precision, recall, and measure, because the objective of this scenario is to achieve high precision. Results are reported in Table 5 for the Random Forest classifier, which performed best. A fully automated environment requires high precision, thus we defined a project-independent cut off point based on , which achieves a precision above across all projects when using the All attribute set. The other classifiers, J48 and Näive Bayes, were unable to achieve the required precision. For Random Forest, the recall drops to on average as a consequence of required precision and thus only one out of two known links would be re-created. In project Derby, the recall for All is , and similar values for the Process, and the Auto sets are achieved. However, the attributes set only containing textual similarity attributes performs best resulting in the highest measure, which favors precision over recall. As in the previous evaluation scenario, structural attributes do not perform well on this project, which is further discussed in the next section.
6. Constructing Unknown Links
The previous experiment was designed to reconstruct known links. However, the real value of our classifier is in recommending tags for commits with no existing links. While we have strong, albeit not perfect, confidence that the explicitly linked pairs of commits and issues are correctly labeled; however the non-links constitute a combination of true negative links (i.e. correctly labeled non-links) and false negative links (i.e. incorrectly labeled non-links). Of these, the false negatives represent the missing links that we now target. These missing links result from cases in which a developer failed to associate a commit with an issue or created an incomplete set of tags. Previous studies have reported the difficulty of correctly classifying entities not represented in the original training set (Park et al., 2012) and we therefore need to evaluate the ability of the classifier to detect previously missing links.
Since no answer set for the non-linked commits is available, we needed to perform a manual inspection of the proposed links. As a sanity check we first evaluated whether it would be plausible to classify links on these unknown parts. In all six projects commits with links are typically related to only one issue or to a very small number of them (see Table 2). Therefore, we count the average number of issues classified as links for each of the commits without any explicit link (see Table 6). The classifier trained using attribute set All identifies an average of issues per commit as a candidate link. Table 1 and 2 characterize the current linking situation in the studied projects. Based on these values, we expect a value of links per commit. For example in project Infinispan, there were commits linked to bugs and bugs linked to commits, and thus commits per bug. The ratio for improvements is . However, the classifier proposes bugs per commit and improvements for every non-linked commit. That means that our approach is conservative. For project Derby, our approach underperformed. The existing ratio for linked commits per bug is , as for the other projects. But the classifier suggests . This may stem from the imbalance of non linked commits and non linked issues in the project. There are non linked issues and only non linked commits. However, the same imbalance of non linked issues and commits also exists in project Pig, but in this context, the classifier is unaffected.
This analysis shows that except for Derby, the classified number of links is plausible. However, it is not clear whether these links are correctly classified. To accomplish this goal we manually evaluated the correctness of a random selection of new links proposed by the classifier using the following systematic process for each project. Steps 1-4 are independently performed by one researcher (data preparer), while steps 6-7 are performed collaboratively by four additional researchers (referred to as evaluators). No communication was allowed between the researcher creating the dataset and the four evaluators during this process.
Data Set Construction for Missing Links
Twenty commits without any explicitly tagged issues from the original data set for a given project were randomly selected and randomly divided into two groups and . 70% were placed into group A and 30% into group B.
For commits in group the most highly ranked issue ID was selected as the candidate link, while for commits in group an issue tag that was not recommended by the classifier was selected. Group was added to mitigate evaluation bias and to ensure a mix of links and non-links in the evaluation set.
A randomly ordered list of each commit-issue pair selected in the previous step was generated:
Human Evaluation of Proposed Links
Four human evaluators worked together to classify the first five commit-issue pairs from one randomly selected project. They performed this task without any knowledge of whether the link was recommended by the classifier or not. The evaluators then worked individually to classify the next five commit-issue pairs in the list.
The Fleiss kappa inter-rater agreement was computed. Fleiss’s kappa assesses the likelihood of more than two raters agreeing when classifying items into a set number of categories (Fleiss, 1971). A kappa value of 1 means that all raters are in agreement, though a value above 0.4 indicates strong agreement. Evaluators discussed results for 20 commit-issue pairs with the aim of achieving consensus in classifying the pair as having a link or not. The Fleiss kappa value for this evaluation was approximately 0.5617, demonstrating the reliability of the evaluators to agree on the link status between a commit and issue.
As satisfactory inter-rater agreement was achieved, the remaining pairs of commit-issues were split amongst the evaluators and all pairs were evaluated. The decisions made by the evaluators constitutes the “answer set” of previously unknown links against which the classifier is evaluated.
Due to the labor intensive nature of this analysis, we evaluated only three projects: Derby, Drools, and Maven. Recall and precision were computed by comparing the results returned by the classifier against the manually created “answer set”. Results obtained for forty commit-issue pairs for each project are summarized in Table 7.
Links1, Non-Links2, True Positives3,False Positives4, True Negatives5, False Negatives6, Precision7, Recall8
Results indicate that all projects except one returned a recall of 100% (i.e. 100%). The exception was Maven where recall of 100% was achieved for commit to bug links, but only 50% for commit to improvement links. In this case, there were only two true links, and one of them was missed. This means that the classifier found the pair to be unconnected while the evaluator determined that a link did exist. The precision returned for each of the three projects for both bugs and improvement was lower than the precision returned in the earlier experiments with explicitly defined links. For example, in earlier experiments Derby’s precision was 0.30 for bugs and 0.32 for improvements. However, these scores dropped to 0.25 and 0.11 respectively when the classifier was used to generate links for commits with no previously known issue tags. Similar trends were observed for Drools. However, precision dropped considerably for Maven returning 0.11 for bugs, but only 0.04 for improvements. A potential explanation for the poor precision result in the Maven project is the fact that a majority of the commits represent code refactoring and in many cases were not associated with any issues at all - resulting in several false positive links. This was also the case in other projects where several commits were not directly associated with any particular issue but addressed a more trivial task such as correcting a typo or adding a comment in java docs. These types of commit negatively impact overall precision.
7. Related Work
The most closely related work falls under the two areas of feature location and tracing bug reports to code.
Feature location attempts to identify sections of source code related to a specific requirement or issue. Several authors have looked at static approaches based on information retrieval techniques. For example, Antoniol et al. (Antoniol et al., 2002) used a probabilistic approach to retrieve trace links between code and documentation. Hayes et al. used the Vector Space Model (VSM) algorithm in conjunction with a thesaurus to establish trace links (Huffman Hayes et al., 2006). Other studies applied Latent Semantic Indexing (De Lucia et al., 2004; Rempel et al., 2013), Latent Dirichlet Allocation (LDA) (Dekhtyar et al., 2007; Asuncion et al., 2010), or recurrent neural networks (Guo et al., 2017) to integrate semantics or context in which various terms are used. Other researchers have combined results of individual algorithms (Lohar et al., 2013b; Dekhtyar et al., 2007; Gethers et al., 2011), applied AI swarm techniques (Sultanov et al., 2011) and combined heuristic rules with trace retrieval techniques (Spanoudakis et al., 2004; Guo et al., 2013; Cleland-Huang et al., 2012). Our approach leverages information retrieval to compute similarity between various types of issues, commit messages, and code. We investigated the use of LSI but rejected it for a pragmatic reason that it had a long execution time, and further, that prior studies have not shown it to outperform VSM. Ultimately we adopted a VSM-based approach, that outperformed basic VSM and integrated natural language concepts.
Researchers have also integrated structural analysis of the code to support feature location (Rasool and Mäder, 2011; McMillan et al., 2012; Panichella et al., 2013). We did not include this in our current classifier; however, we will consider it in future work. Structural analysis may be especially helpful for finding additional classes that are related to an issue or bug. In less closely related work, researchers investigated the use of dynamic analysis for feature location (Kuang et al., 2012, 2015, 2017). Furthermore, Eisenbarth et al., (Eisenbarth et al., 2003) presented a technique combining dynamic and static analyses to rapidly focus on the system’s parts that relate to a specific set of features.
Our work focuses not only on feature requests (i.e. improvements), but also tracing bugs to code. Canfora et al. used information retrieval techniques to identify files that were created or changed, in response to a Bugzilla ticket (Canfora and Cerulo, 2005). They identified files changed in response to similar bug reports in the past, using standard information retrieval techniques. Kim et al. predicted which source code files would change as a result of bug-fix requests (Kim et al., 2013) using Mozilla FireFox and Core code repositories as their corpus in tandem with the public Bugzilla database. They first trained a classifier to recognize ‘usable’ versus ‘non-usable’ bug reports, and then using the bugs classified as usable, trained a second classifier to identify impacted classes. Our approach differs from their work in that our goal is to generate links directly from commits to issues so that we can make direct recommendations to users if they forget to tag a commit. Our goal is therefore to create trace links as the commits are made so that developers can accept or reject them in order to create a set of trusted links. In (Schermann et al., 2015), the authors proposed two heuristics, Loners and Phantoms, to infer trace links between commits and issues. We incorporate their concepts as one attribute in our classifier.
8. Threats to Validity
There are several potential threats to the validity of our study.
Internal Validity We split the available data set for each project into 80–20% of the issues retaining the temporal ordering of the project. Choosing another split point may produce different evaluation results. We considered explicitly only “resolved” bugs and improvements, assuming that all required source code modifications had already taken place. It may be possible that the process of resolving an issue does not manifest in commits. We tried to mitigate this, by focusing on commits marked as “Fixed” or “Resolved”; however, some commits might intentionally not address an issue due to their triviality. This was evidenced in our final experiment, where our classifier recommended links even though no links existed. Furthermore, our study focused on improvements and bugs, as these were the predominant types of instances in our projects; however, we observed comparable commit link patterns for other issues types, suggesting that our approach would generalize.
External Validity Our study focused solely on open-source projects. A potential threat to external validity arises when we want to generalize our findings to a wider set of project, including commercial development. We have observed evidence of similar tagging practices in our own industrial collaborations, and therefore expect similar results. However internal company regularities might influence commit practices, and thus the overall applicability of our approach is an open question. Another threat that might limit the generalizability of our results is the use of only one combination of issue tracking system (Jira) and version control system (Git). Other tools and platforms might encourage and/or provide different linking behavior.
In this paper, we studied the interlinking of commits and issues in open source development projects. An analysis of six large projects showed that on average only 60% of the commits are linked to issues. This incomplete linkage fundamentally limits the establishment of project-wide traceability. To overcome this problem, we propose an approach that trains a classifier to recommend links at the time commits are made and also augments an existing set of commits and issues with automatically identified links. We identified structural, temporal, stakeholder-related and textual similarity factors as relevant information for automating this task and derived 18 attributes to quantify the relation between commit–issue pairs. A Random Forest classifier performed best on the trained attributes. We evaluated this trained model through conducting four different experiments. Two experiments studied classification performance for recommending links upon a new commit as well as for automatically augmenting missing links. We found that the classifier yielded on average recall in a short list of three recommendations and could on average automatically augment every second link correctly with an average error of . Finally, we manually constructed a small answer set of links from the set of previously unlinked commits and showed that the classifier returned high recall results averaging 91.6% and precision of 17.3%.
The work was partially funded by the German Ministry of Education and Research (BMBF) grants: 01IS14026A, 01IS16003B, by DFG grant: MA 5030/3-1, and by the EU EFRE/Thüringer Aufbaubank (TAB) grant: 2015FE9033. It was also funded by the US National Science Foundation Grant CCF:1319680.
- Antoniol et al. (2002) Giuliano Antoniol, Gerardo Canfora, Gerardo Casazza, Andrea De Lucia, and Ettore Merlo. 2002. Recovering Traceability Links between Code and Documentation. IEEE Transactions on Software Engineering 28, 10 (2002), 970–983. https://doi.org/10.1109/TSE.2002.1041053
- Anvik et al. (2006) John Anvik, Lyndon Hiew, and Gail C. Murphy. 2006. Who should fix this bug?. In 28th International Conference on Software Engineering (ICSE 2006), Shanghai, China, May 20-28, 2006, Leon J. Osterweil, H. Dieter Rombach, and Mary Lou Soffa (Eds.). ACM, 361–370. https://doi.org/10.1145/1134336
- Asuncion et al. (2010) Hazeline U. Asuncion, Arthur U. Asuncion, and Richard N. Taylor. 2010. Software traceability with topic modeling. In Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 1, ICSE 2010, Cape Town, South Africa, 1-8 May 2010, Jeff Kramer, Judith Bishop, Premkumar T. Devanbu, and Sebastián Uchitel (Eds.). ACM, 95–104. https://doi.org/10.1145/1806799.1806817
- Bachmann and Bernstein (2009) Adrian Bachmann and Abraham Bernstein. 2009. Software Process Data Quality and Characteristics: A Historical View on Open and Closed Source Projects. In Proceedings of the Joint International and Annual ERCIM Workshops on Principles of Software Evolution (IWPSE) and Software Evolution (Evol) Workshops (IWPSE-Evol ’09). ACM, 119–128. https://doi.org/10.1145/1595808.1595830
- Breiman (2001) Leo Breiman. 2001. Random forests. Machine learning 45, 1 (2001), 5–32.
- Canfora and Cerulo (2005) Gerardo Canfora and Luigi Cerulo. 2005. Impact analysis by mining software and change request repositories. In Software Metrics, 2005. 11th IEEE International Symposium. IEEE, 9–pp.
- Cavnar (1995) W Cavnar. 1995. Using an n-gram-based document representation with a vector processing retrieval model. NIST SPECIAL PUBLICATION SP (1995), 269–269.
- Cleland-Huang et al. (2014) Jane Cleland-Huang, Orlena Gotel, Jane Huffman Hayes, Patrick Mäder, and Andrea Zisman. 2014. Software traceability: trends and future directions. In Proceedings of the on Future of Software Engineering, FOSE 2014, Hyderabad, India, May 31 - June 7, 2014, James D. Herbsleb and Matthew B. Dwyer (Eds.). ACM, 55–69. https://doi.org/10.1145/2593882.2593891
- Cleland-Huang et al. (2012) Jane Cleland-Huang, Mats Heimdahl, Jane Huffman Hayes, Robyn Lutz, and Patrick Mäder. 2012. Trace Queries for Safety Requirements in High Assurance Systems. In 18th International Working Conference on Requirements Engineering: Foundation for Software Quality (REFSQ) (LNCS), Vol. 7195. 179–193.
- Cleland-Huang et al. (2012) Jane Cleland-Huang, Patrick Mäder, Mehdi Mirakhorli, and Sorawit Amornborvornwong. 2012. Breaking the big-bang practice of traceability: Pushing timely trace recommendations to project stakeholders. In 2012 20th IEEE International Requirements Engineering Conference (RE), Chicago, IL, USA, September 24-28, 2012, Mats Per Erik Heimdahl and Pete Sawyer (Eds.). IEEE Computer Society, 231–240. https://doi.org/10.1109/RE.2012.6345809
- D’Ambros et al. (2012) Marco D’Ambros, Michele Lanza, and Romain Robbes. 2012. Evaluating defect prediction approaches: a benchmark and an extensive comparison. Empirical Software Engineering 17, 4-5 (2012), 531–577.
- De Lucia et al. (2004) Andrea De Lucia, Fausto Fasano, Rocco Oliveto, and Genoveffa Tortora. 2004. Enhancing an Artefact Management System with Traceability Recovery Features. In 20th IEEE International Conference on Software Maintenance (ICSM). 306–315.
- Dekhtyar et al. (2007) Alex Dekhtyar, Jane Huffman Hayes, Senthil Karthikeyan Sundaram, Elizabeth Ashlee Holbrook, and Olga Dekhtyar. 2007. Technique Integration for Requirements Assessment. In 15th IEEE International Requirements Engineering Conference, RE 2007, October 15-19th, 2007, New Delhi, India. IEEE Computer Society, 141–150. https://doi.org/10.1109/RE.2007.17
- Eaddy et al. (2008) Marc Eaddy, Alfred V. Aho, Giuliano Antoniol, and Yann-Gaël Guéhéneuc. 2008. CERBERUS: Tracing Requirements to Source Code Using Information Retrieval, Dynamic Analysis, and Program Analysis. In The 16th IEEE International Conference on Program Comprehension, ICPC 2008, Amsterdam, The Netherlands, June 10-13, 2008, René L. Krikhaar, Ralf Lämmel, and Chris Verhoef (Eds.). IEEE Computer Society, 53–62. https://doi.org/10.1109/ICPC.2008.39
- Eisenbarth et al. (2003) Thomas Eisenbarth, Rainer Koschke, and Daniel Simon. 2003. Locating features in source code. IEEE Transactions on software engineering 29, 3 (2003), 210–224.
- Falessi et al. (pear) Davide Falessi, Massimiliano Di Penta, G. Canfora, and G. Cantone. to appear. Estimating the number of remaining links in traceability recovery. Empirical Software Engineering (to appear).
- Fleiss (1971) Joseph L. Fleiss. 1971. Measuring nominal scale agreement among many raters. Psychological Bulletin 76, 5 (1971), 378 – 382.
- Foundation ([n. d.]) Apache Software Foundation. [n. d.]. How Should I Apply Patches From A Contributor. ([n. d.]). http://www.apache.org/dev/committers.html#applying-patches
- Gethers et al. (2011) M. Gethers, R. Oliveto, D. Poshyvanyk, and Andrea De Lucia. 2011. On Integrating Orthogonal Information Retrieval Methods to Improve Traceability Link Recovery. In 27th IEEE International Conference on Software Maintenance (ICSM). 133–142.
- Git SCM ([n. d.]) Git SCM [n. d.]. Git SCM. ([n. d.]). http://www.git-scm.com.
- Gotel et al. (2012) Orlena Gotel, Jane Cleland-Huang, Jane Huffman Hayes, Andrea Zisman, Alexander Egyed, Paul Grünbacher, Alex Dekhtyar, Giuliano Antoniol, Jonathan Maletic, and Patrick Mäder. 2012. Traceability Fundamentals. In Software and Systems Traceability, Jane Cleland-Huang, Orlena Gotel, and Andrea Zisman (Eds.). Springer, 3–22. http://dx.doi.org/10.1007/978-1-4471-2239-51 10.1007/978-1-4471-2239-51.
- Gotel and Finkelstein (1994) O. C. Z. Gotel and Anthony Finkelstein. 1994. An analysis of the requirements traceability problem. In Proceedings of the First IEEE International Conference on Requirements Engineering, ICRE ’94, Colorado Springs, Colorado, USA, April 18-21, 1994. IEEE, 94–101. https://doi.org/10.1109/ICRE.1994.292398
- Guo et al. (2017) Jin Guo, Jinghui Cheng, and Jane Cleland-Huang. 2017. Semantically enhanced software traceability using deep learning techniques. In Proceedings of the 39th International Conference on Software Engineering, ICSE 2017, Buenos Aires, Argentina, May 20-28, 2017, Sebastián Uchitel, Alessandro Orso, and Martin P. Robillard (Eds.). IEEE / ACM, 3–14. https://doi.org/10.1109/ICSE.2017.9
- Guo et al. (2013) Jin Guo, Jane Cleland-Huang, and Brian Berenbach. 2013. Foundations for an expert system in domain-specific traceability. In 21st IEEE International Requirements Engineering Conference (RE). 42–51. https://doi.org/10.1109/RE.2013.6636704
- Guo et al. (2016) Jin Guo, Mona Rahimi, Jane Cleland-Huang, Alexander Rasin, Jane Huffman Hayes, and Michael Vierhauser. 2016. Cold-start software analytics. In Proceedings of the 13th International Conference on Mining Software Repositories, MSR 2016, Austin, TX, USA, May 14-22, 2016. 142–153. https://doi.org/10.1145/2901739.2901740
- Guo et al. (2004) Lan Guo, Yan Ma, Bojan Cukic, and Harshinder Singh. 2004. Robust Prediction of Fault-Proneness by Random Forests. In Proceedings of the 15th International Symposium on Software Reliability Engineering (ISSRE ’04). IEEE Computer Society, Washington, DC, USA, 417–428. https://doi.org/10.1109/ISSRE.2004.35
- Hall et al. (2009) Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten. 2009. The WEKA Data Mining Software: An Update. SIGKDD Explor. Newsl. 11, 1 (Nov. 2009), 10–18. https://doi.org/10.1145/1656274.1656278
- Hall (1998) M. A. Hall. 1998. Correlation-based Feature Subset Selection for Machine Learning. Ph.D. Dissertation. University of Waikato, Hamilton, New Zealand.
- Huffman Hayes et al. (2006) Jane Huffman Hayes, Alex Dekhtyar, and Senthil Karthikeyan Sundaram. 2006. Advancing Candidate Link Generation for Requirements Tracing: The Study of Methods. IEEE Transactions on Software Engineering 32, 1 (2006), 4–19.
- JIRA ([n. d.]) JIRA [n. d.]. Jira Issue Tracking Software. ([n. d.]). http://www.jira.com.
- Kim et al. (2013) Dongsun Kim, Yida Tao, Sunghun Kim, and Andreas Zeller. 2013. Where should we fix this bug? a two-phase recommendation model. Software Engineering, IEEE Transactions on 39, 11 (2013), 1597–1610.
- Kim et al. (2011) Sunghun Kim, Hongyu Zhang, Rongxin Wu, and Liang Gong. 2011. Dealing with Noise in Defect Prediction. In Proceedings of the 33rd International Conference on Software Engineering (ICSE ’11). ACM, New York, NY, USA, 481–490. https://doi.org/10.1145/1985793.1985859
- Kuang et al. (2015) Hongyu Kuang, Patrick Mäder, Hao Hu, Achraf Ghabi, LiGuo Huang, Jian Lü, and Alexander Egyed. 2015. Can method data dependencies support the assessment of traceability between requirements and source code? Journal of Software: Evolution and Process 27, 11 (2015), 838–866. https://doi.org/10.1002/smr.1736
- Kuang et al. (2012) Hongyu Kuang, Patrick Mäder, Hao Hu, Achraf Ghabi, LiGuo Huang, Jian Lv, and Alexander Egyed. 2012. Do data dependencies in source code complement call dependencies for understanding requirements traceability?. In 28th IEEE International Conference on Software Maintenance (ICSM). 181–190.
- Kuang et al. (2017) Hongyu Kuang, Jia Nie, Hao Hu, Patrick Rempel, Jian Lu, Alexander Egyed, and Patrick Mäder. 2017. Analyzing closeness of code dependencies for improving IR-based Traceability Recovery. In IEEE 24th International Conference on Software Analysis, Evolution and Reengineering, SANER 2017, Klagenfurt, Austria, February 20-24, 2017, Martin Pinzger, Gabriele Bavota, and Andrian Marcus (Eds.). IEEE Computer Society, 68–78. https://doi.org/10.1109/SANER.2017.7884610
- Lohar et al. (2013a) Sugandha Lohar, Sorawit Amornborvornwong, Andrea Zisman, and Jane Cleland-Huang. 2013a. Improving trace accuracy through data-driven configuration and composition of tracing features. In 9th Joint Meeting on Foundations of Software Engineering (ESEC/FSE). 378–388.
- Lohar et al. (2013b) Sugandha Lohar, Sorawit Amornborvornwong, Andrea Zisman, and Jane Cleland-Huang. 2013b. Improving trace accuracy through data-driven configuration and composition of tracing features. In Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering. ACM, 378–388.
- Lucia et al. (2012) Andrea De Lucia, Andrian Marcus, Rocco Oliveto, and Denys Poshyvanyk. 2012. Information Retrieval Methods for Automated Traceability Recovery. In Software and Systems Traceability. 71–98. https://doi.org/10.1007/978-1-4471-2239-5_4
- Mäder and Cleland-Huang (2013) Patrick Mäder and Jane Cleland-Huang. 2013. A visual language for modeling and executing traceability queries. Software and System Modeling 12, 3 (2013), 537–553.
- Mäder and Cleland-Huang (2015) Patrick Mäder and Jane Cleland-Huang. 2015. From Raw Project Data to Business Intelligence. IEEE Software 32, 4 (2015), 22–25. https://doi.org/10.1109/MS.2015.92
- Mäder and Egyed (2015) Patrick Mäder and Alexander Egyed. 2015. Do developers benefit from requirements traceability when evolving and maintaining a software system? Empirical Software Engineering 20, 2 (2015), 413–441. https://doi.org/10.1007/s10664-014-9314-z
- Mäder et al. (2013) Patrick Mäder, Paul L. Jones, Yi Zhang, and Jane Cleland-Huang. 2013. Strategic Traceability for Safety-Critical Projects. IEEE Software 30, 3 (2013), 58–66.
- Mäder et al. (2017) Patrick Mäder, Rocco Oliveto, and Andrian Marcus. 2017. Empirical studies in software and systems traceability. Empirical Software Engineering 22, 3 (2017), 963–966. https://doi.org/10.1007/s10664-017-9509-1
- Mann and Whitney (1947) Henry B Mann and Donald R Whitney. 1947. On a test of whether one of two random variables is stochastically larger than the other. The annals of mathematical statistics (1947), 50–60.
- McMillan et al. (2012) Collin McMillan, Negar Hariri, Denys Poshyvanyk, Jane Cleland-Huang, and Bamshad Mobasher. 2012. Recommending source code for use in rapid software prototypes. In 34th International Conference on Software Engineering, ICSE 2012, June 2-9, 2012, Zurich, Switzerland, Martin Glinz, Gail C. Murphy, and Mauro Pezzè (Eds.). IEEE Computer Society, 848–858. https://doi.org/10.1109/ICSE.2012.6227134
- Panichella et al. (2013) Annibale Panichella, Collin McMillan, Evan Moritz, Davide Palmieri, Rocco Oliveto, Denys Poshyvanyk, and Andrea De Lucia. 2013. When and How Using Structural Information to Improve IR-Based Traceability Recovery. In 17th European Conference on Software Maintenance and Reengineering, CSMR 2013, Genova, Italy, March 5-8, 2013, Anthony Cleve, Filippo Ricca, and Maura Cerioli (Eds.). IEEE Computer Society, 199–208. https://doi.org/10.1109/CSMR.2013.29
- Park et al. (2012) Jihun Park, Miryung Kim, Baishakhi Ray, and Doo-Hwan Bae. 2012. An empirical study of supplementary bug fixes. In 9th IEEE Working Conference of Mining Software Repositories, MSR 2012, June 2-3, 2012, Zurich, Switzerland, Michele Lanza, Massimiliano Di Penta, and Tao Xie (Eds.). IEEE Computer Society, 40–49. https://doi.org/10.1109/MSR.2012.6224298
- Ramesh and Jarke (2001) Balasubramaniam Ramesh and Matthias Jarke. 2001. Toward Reference Models of Requirements Traceability. IEEE Transactions on Software Engineering 27, 1 (2001), 58–93.
- Rasool and Mäder (2011) Ghulam Rasool and Patrick Mäder. 2011. Flexible design pattern detection based on feature types. In 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011), Lawrence, KS, USA, November 6-10, 2011, Perry Alexander, Corina S. Pasareanu, and John G. Hosking (Eds.). IEEE Computer Society, 243–252. https://doi.org/10.1109/ASE.2011.6100060
- Rath et al. (2018) Michael Rath, Jacob Rendall, Jin Guo, Jane Cleland-Haung, and Patrick Mäder. 2018. Replication Data for: Traceability in the Wild: Automatically Augmenting Incomplete Trace Links. https://goo.gl/2NMjcF. (2018). https://doi.org/10.7910/DVN/BZLSLA
- Regan et al. (2014) Gilbert Regan, Miklós Biró, Fergal McCaffery, Kevin McDaid, and Derek Flood. 2014. A Traceability Process Assessment Model for the Medical Device Domain. In Systems, Software and Services Process Improvement - 21st European Conference, EuroSPI 2014, Luxembourg, June 25-27, 2014. Proceedings (Communications in Computer and Information Science), Béatrix Barafort, Rory V. O’Connor, Alexander Poth, and Richard Messnarz (Eds.), Vol. 425. Springer, 206–216. https://doi.org/10.1007/978-3-662-43896-1_18
- Rempel and Mäder (2016) Patrick Rempel and Patrick Mäder. 2016. A quality model for the systematic assessment of requirements traceability. In Software Engineering 2016, Fachtagung des GI-Fachbereichs Softwaretechnik, 23.-26. Februar 2016, Wien, Österreich (LNI), Jens Knoop and Uwe Zdun (Eds.), Vol. 252. GI, 37–38. http://subs.emis.de/LNI/Proceedings/Proceedings252/article52.html
- Rempel and Mäder (2017) Patrick Rempel and Patrick Mäder. 2017. Preventing Defects: The Impact of Requirements Traceability Completeness on Software Quality. IEEE Trans. Software Eng. 43, 8 (2017), 777–797. https://doi.org/10.1109/TSE.2016.2622264
- Rempel et al. (2013) Patrick Rempel, Patrick Mäder, and Tobias Kuschke. 2013. Towards feature-aware retrieval of refinement traces. In 7th International Workshop on Traceability in Emerging Forms of Software Engineering, TEFSE 2013, 19 May, 2013, San Francisco, CA, USA, Nan Niu and Patrick Mäder (Eds.). IEEE Computer Society, 100–104. https://doi.org/10.1109/TEFSE.2013.6620163
- Rempel et al. (2014) Patrick Rempel, Patrick Mäder, Tobias Kuschke, and Jane Cleland-Huang. 2014. Mind the gap: assessing the conformance of software traceability to relevant guidelines. In 36th International Conference on Software Engineering, ICSE ’14, Hyderabad, India - May 31 - June 07, 2014, Pankaj Jalote, Lionel C. Briand, and André van der Hoek (Eds.). ACM, 943–954. https://doi.org/10.1145/2568225.2568290
- Romo et al. (2014) Bilyaminu Auwal Romo, Andrea Capiluppi, and Tracy Hall. 2014. Filling the Gaps of Development Logs and Bug Issue Data. In Proceedings of The International Symposium on Open Collaboration, OpenSym 2014, Berlin, Germany, August 27 - 29, 2014, Dirk Riehle, Jesús M. González-Barahona, Gregorio Robles, Kathrin M. Möslein, Ina Schieferdecker, Ulrike Cress, Astrid Wichmann, Brent J. Hecht, and Nicolas Jullien (Eds.). ACM, 8:1–8:4. https://doi.org/10.1145/2641580.2641592
- Schermann et al. (2015) Gerald Schermann, Martin Brandtner, Sebastiano Panichella, Philipp Leitner, and Harald C. Gall. 2015. Discovering loners and phantoms in commit and issue data. In Proceedings of the 2015 IEEE 23rd International Conference on Program Comprehension, ICPC 2015, Florence/Firenze, Italy, May 16-24, 2015, Andrea De Lucia, Christian Bird, and Rocco Oliveto (Eds.). IEEE Computer Society, 4–14. https://doi.org/10.1109/ICPC.2015.10
- Seiler and Paech (2017) Marcus Seiler and Barbara Paech. 2017. Using Tags to Support Feature Management Across Issue Tracking Systems and Version Control Systems - A Research Preview. In Requirements Engineering: Foundation for Software Quality - 23rd International Working Conference, REFSQ 2017, Essen, Germany, February 27 - March 2, 2017, Proceedings (Lecture Notes in Computer Science), Paul Grünbacher and Anna Perini (Eds.), Vol. 10153. Springer, 174–180. https://doi.org/10.1007/978-3-319-54045-0_13
- Shin et al. (2015) Yonghee Shin, Jane Huffman Hayes, and Jane Cleland-Huang. 2015. Guidelines for Benchmarking Automated Software Traceability Techniques. In 8th IEEE/ACM International Symposium on Software and Systems Traceability, SST 2015, Florence, Italy, May 17, 2015. 61–67. https://doi.org/10.1109/SST.2015.13
- Spanoudakis et al. (2004) George Spanoudakis, Andrea Zisman, Elena Pérez-Miñana, and Paul Krause. 2004. Rule-based generation of requirements traceability relations. Journal of Systems and Software 72, 2 (2004), 105–127.
- Ståhl et al. (2017) Daniel Ståhl, Kristofer Hallén, and Jan Bosch. 2017. Achieving traceability in large scale continuous integration and delivery deployment, usage and validation of the eiffel framework. Empirical Software Engineering 22, 3 (2017), 967–995. https://doi.org/10.1007/s10664-016-9457-1
- Sultanov et al. (2011) Hakim Sultanov, Jane Huffman Hayes, and Wei-Keat Kong. 2011. Application of swarm techniques to requirements tracing. Requirements Engineering 16, 3 (2011), 209–226.
- Witten et al. (2016) Ian H Witten, Eibe Frank, Mark A Hall, and Christopher J Pal. 2016. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann.