A Manually-Curated Dataset of Fixes to Vulnerabilities of Open-Source Software
Advancing our understanding of software vulnerabilities, automating their identification, the analysis of their impact, and ultimately their mitigation is necessary to enable the development of software that is more secure.
While operating a vulnerability assessment tool that we developed and that is currently used by hundreds of development units at SAP, we manually collected and curated a dataset of vulnerabilities of open-source software and the commits fixing them. The data was obtained both from the National Vulnerability Database (NVD) and from project-specific Web resources that we monitor on a continuous basis.
From that data, we extracted a dataset that maps 624 publicly disclosed vulnerabilities affecting 205 distinct open-source Java projects, used in SAP products or internal tools, onto the 1282 commits that fix them. Out of 624 vulnerabilities, 29 do not have a CVE identifier at all and 46, which do have a CVE identifier assigned by a numbering authority, are not available in the NVD yet.
The dataset is released under an open-source license, together with supporting scripts that allow researchers to automatically retrieve the actual content of the commits from the corresponding repositories and to augment the attributes available for each instance. Also, these scripts allow to complement the dataset with additional instances that are not security fixes (which is useful, for example, in machine learning applications).
Our dataset has been successfully used to train classifiers that could automatically identify security-relevant commits in code repositories. The release of this dataset and the supporting code as open-source will allow future research to be based on data of industrial relevance; also, it represents a concrete step towards making the maintenance of this dataset a shared effort involving open-source communities, academia, and the industry.
While the availability of mature, high-quality open-source software (OSS) components represents an opportunity for the software industry to accelerate innovation and lower costs, maintaining a secure OSS supply chain and an effective vulnerability management process remains a difficult challenge.
In recent years, several approaches to the management of OSS vulnerabilities have emerged and even the market of commercial tools has matured significantly. However, as shown in , an effective vulnerability management approach cannot rely purely on metadata (which are often inaccurate, incomplete, or missing). Rather, it has to be code-centric, that is, it must be based on the actual analysis of vulnerabilities and their fixes at code level.
While operating a vulnerability assessment tool that we developed to implement such code-centric approach, and that is currently used by hundreds of development units at SAP, we manually collected a substantial amount of code-level data about vulnerabilities of open-source software and their fixes.
Starting from that data, and with the objective to study automated approaches to simplify the expensive and difficult problem of manually identifying the fixes for new vulnerability disclosures, we created a dataset, which we present here, that was successfully used to train automated commit classifiers .
Our dataset maps 624 publicly disclosed vulnerabilities affecting 205 distinct open-source Java projects used in SAP software (either products or internal tools) onto the 1282 commits that fix them. It was constructed and manually curated over a period of four years, monitoring the disclosure of security advisories, not only from the NVD, but also from numerous project-specific Web pages.
Compared to existing works, our dataset has several distinguishing characteristics.
Differently from , who constructed their dataset by automatically extracting data from the whole NVD, we manually curated the data of each vulnerability analyzed. While the overall amount of vulnerabilities we could cover is lower, by manually curating our dataset we could ensure the quality of each entry, detecting and addressing some of the inconsistencies that are known to affect the NVD , and including many commits that could not be found on the NVD. We estimate that over 70% of the vulnerabilities in our dataset do not have any link to a source code repository in their NVD page.
Because the selection of the projects reflects the population of OSS that are actually used in real enterprise software products or internal tools, our dataset only includes projects that have practical industrial relevance. We are not aware of any other freely available dataset that has this characteristic.
Differently from , our dataset contains commits that implement fixes to vulnerabilities, whereas theirs contains commits that introduce vulnerabilities.
Snyk111https://snyk.io/vuln/ does advertise its vulnerability databases through its website. Such database is focused on vulnerability descriptions and a reference to the fix is not always included. Also, the access to the Snyk dataset is subject to restrictions and their data are not available for free download222Snyk stopped offering dumps of their vulnerability database on GitHub as of February 2018.. Our dataset instead is available under the Apache license and its release is part of an initiative to promote a collaborative effort involving industry, academia, and the open-source community, to maintain and extend the dataset in the long run.
Ii Dataset Construction
The motivation for collecting code-level vulnerability data originates in the work that SAP Security Research has performed since 2014 on the analysis of vulnerabilities in open-source software. A key result of that work is an approach to the detection, assessment and mitigation of open-source vulnerabilities . The approach is implemented as a tool (internally called Vulas) that has been productively used at SAP to analyze hundreds of Java and Python software applications and that is now publicly available as free open-source software released under the Apache version 2.0 license333https://github.com/sap/vulnerability-assessment-tool.
The availability of a vulnerability knowledge base, that is comprehensive, accurate, and updated in a timely manner, is the prerequisite for such a tool to operate effectively. For this reason, we continuously monitor both the NVD and over 50 distinct project-specific websites for new vulnerability disclosures. For each such disclosure, we manually review the available information and we search for the corresponding fix-commit(s) in the code repository of the affected open-source component. The result of this ongoing activity, started in 2014, is a database of vulnerability data that has a very high coverage of the projects that are relevant to our company. From this database, we extracted 1282 commits, from 205 distinct open-source Java projects used in SAP software (either products or internal tools). These commits correspond to the fixes of 624 publicly known vulnerabilities444The dataset we describe in this paper is a snapshot of our vulnerability database taken on 21 January 2019; it is available as a comma-separated file at https://github.com/copernico/msr2019. Some of the entries were discarded because the commits were invalid (e.g., some repositories had been dismissed or moved since we first introduced them in the dataset)..
Iii Dataset Description
The dataset consists of a set of 4-tuples
where is the identifier of the vulnerability being fixed in the commit with identifier performed in the source code repository at . Each entry of the dataset represents a commit that contributes to fixing a vulnerability (so-called fix commits) and thus the is always positive (pos). In certain applications, commits of the negative classes are also needed (see Section IV, which would have a neg class label). The dataset is released together with supporting scripts to manipulate and extend it under the Apache 2.0 license.
The dataset covers 205 distinct projects, and includes 1282 unique commits corresponding to the fixes to 624 vulnerabilities. Differently from other existing datasets (e.g., ), our dataset includes not only vulnerabilities available at the National Vulnerability Database, but also those obtained from project-specific advisories. In particular the dataset includes 29 vulnerabilities without CVE name555CVE names are the most-used naming convention used to enumerate vulnerabilities., and 46 vulnerabilities which have been given a CVE identifier by a CVE numbering authority, but are not yet published on NVD.
The dataset can be considered to cover a representative sample of projects of practical industrial relevance: these projects were identified based on an analysis of the data collected at SAP while operating our open-source vulnerability assessment tool (internally known as Vulas) for a period of about four years, during which the tool was used for hundreds of thousands of security scans. Most of the open-source projects included in the dataset are hosted on GitHub.com (or a mirror of their official repository is available there).
The dataset is released as comma-separated values (CSV) file, which makes it readily usable as input to scripts that can fetch automatically additional data from each repository. We do provide example code (in the form of a Jupyter notebook and a few supporting Python scripts) that simplify the manipulation, the extension, and the analysis of the dataset.
As shown in Figure 1, most of the vulnerabilities (364) are fixed in a single commit whereas in 7 cases the number of commits done to fix the vulnerability goes over ten, up to twenty-three for CVE-2015-5348. The top-20 vulnerabilities by number of fix-commits is shown in Table IV-a.
Figure 2 shows the distribution of the vulnerabilities in the dataset over the years. The bias towards recent years is mostly due to the fact that the number of vulnerabilities published keeps increasing. Moreover it is often the case that very limited information are available for old vulnerabilities (e.g., dead reference links; moved repositories no longer available). Finally, as the dataset is manually-curated, the effort spent to identify fix-commits needs to be allocated carefully, and recent vulnerabilities are typically given higher priority.
Figure 3 shows the distribution of vulnerabilities across different repositories. In particular it highlights that the majority of repositories (178 over 205) has only one vulnerability. The top-20 projects by number of vulnerabilities are listed in Table IV-b
Using the scripts we distribute with the dataset, researchers can easily clone all the repositories referred to in the dataset and follow our code samples to extract any commit feature available through Git.
As an example, for each fix-commit, we automatically determine the oldest tag from which that commit is reachable and compute the time elapsed between the two: this can be used to study the time elapsing from the time when a vulnerability fix is committed in a repository until a new non-vulnerable release (containing such fix) is created (and tagged).
From the security standpoint, because attackers could easily monitor issues and commits in the repositories of security-relevant open source projects, it is critical to keep this delay as short as possible. During that time-frame, the fix (and thus the vulnerability that it fixes) is public, but the clients relying on the affected open-source project cannot update to a non-vulnerable release, because it does not exist yet.
Figure 4 shows the distribution of such delays for the commits in our dataset. The figure shows that most fix commits (817) are released in less than 100 days (of which 181 are released the same day), but delays can be much higher in a considerable number of cases. Detailed information about how many days occur from fix-commits to their releases is available in Table IV-c.
It is interesting to observe that as many as 167 commits are not reachable from any tag (see last line in Table IV-c). Since most open-source projects follow the established practice of creating a tag for each release, this figure might suggest that a non-negligible number of fix-commits is available in the code repository of the project but is not yet part of any release.
Another example of a possible application of the dataset is presented in . Motivated by the need to automate the maintenance of the very vulnerability database from which our dataset is extracted, Sabetta and Bezzi.  presented a novel approach to the automated classification of commits that are security-relevant (i.e., that are likely to fix a vulnerability). They used (an older, and smaller version of) the dataset presented here to train two independent classifiers, considering, respectively, the patch introduced by a commit (Patch Classifier) and the log messages (Message Classifier), without relying on information from vulnerability advisories.
Inspired by the naturalness hypothesis [1, 3], the Patch Classifier treats source code changes as documents written in natural language (code-as-text), and it classifies them using established Natural Language Processing (NLP) methods. A similar approach is also used for the Message Classifier. The results of the two classifiers, which are tuned for high precision, are then combined with a simple voting mechanism that flags a commit as security-relevant as soon as at least one of the two models does.
It is common practice to commit the same change in multiple branches of the same repository, so that the fix is made available to users of different supported versions of the component at hand. As a consequence of this practice, our dataset contains duplicates (commits with different identifier but identical content) which need to be removed before training a classifier. After the de-duplication, we are left with 862 unique commits (positive instances).
In machine learning applications such as this one, the dataset must include commits corresponding both to security fixes (positive class) and to other changes not related to security (negative class). To support these applications, we provide a script that, starting from the set of positive instances at our disposal, augments the dataset with negative instances (non-security commits). The algorithm we use, works as follows: for each positive instance from repository , we take random commits from and, under the assumption that security-relevant commits are rare compared to other types of commits, we treated these as “negative” examples. To avoid including obvious outliers (extremely large, empty, or otherwise invalid commits), we performed a manual review of these commits, supported by ad-hoc scripts and pattern matching (similarly to ). These patterns are used to speed-up the manual review by searching for patterns in the commit messages that would indicate with high probability that the commit was security-related.
The model of  seems to represent an improvement over a similar state-of-the-art approach  which relies on log messages processed through a more complex architecture trained with a different (and substantially larger) dataset. Unfortunately, the dataset used in  is not public, which makes it impossible to conduct a reliable comparison of the two approaches.
We hope that by sharing our dataset we will encourage further works to challenge the existing state of the art and to demonstrate the extent of their improvements on the basis of a common benchmark that is freely available, machine-readable, and covering open-source projects that are of actual industrial relevance.
V Concluding remarks
We have presented a dataset of fixes to vulnerabilities in Java OSS projects of industrial relevance, resulting from our experience with developing and operating an open-source vulnerability management solution at SAP.
While the data we are releasing at this time cover only Java projects, in the meantime the tool has evolved to support more languages and consequently we are working to extend our dataset further.
To remain useful for the validation of enterprise applications in the long term, the dataset needs to be updated on a continuous basis to incorporate new vulnerabilities as soon as they are disclosed. While the current dataset has been constructed at SAP, we are working to make its future maintenance a community effort and to encourage the very developers who implement the security fixes for vulnerable open-source components to contribute to this dataset. To facilitate this community effort, we plan to release support tools and processes, thus complementing the release of our vulnerability assessment tool, which was open-sourced in October 2018. We are convinced that the tool and its vulnerability database represent a practical contribution to ensuring the security of the open-source software supply-chain.
-  Miltiadis Allamanis, Earl T. Barr, Premkumar Devanbu, and Charles Sutton. A survey of machine learning for big code and naturalness. arXiv preprint arXiv:1709.06182, 2017.
-  Antonios Gkortzis, Dimitris Mitropoulos, and Diomidis Spinellis. Vulinoss: A dataset of security vulnerabilities in open-source systems. In Proceedings of the 15th International Conference on Mining Software Repositories, MSR ’18, pages 18–21, New York, NY, USA, 2018. ACM.
-  Abram Hindle, Earl T. Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu. On the naturalness of software. In Proceedings of the 34th International Conference on Software Engineering, ICSE ’12, pages 837–847, Piscataway, NJ, USA, 2012. IEEE Press.
-  Ivan Pashchenko, Henrik Plate, Serena Elisa Ponta, Antonino Sabetta, and Fabio Massacci. Vulnerable open source dependencies: counting those that matter. In Markku Oivo, Daniel Méndez Fernández, and Audris Mockus, editors, Proceedings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, ESEM 2018, Oulu, Finland, October 11-12, 2018, pages 42:1–42:10. ACM, 2018.
-  Henning Perl, Sergej Dechand, Matthew Smith, Daniel Arp, Fabian Yamaguchi, Konrad Rieck, Sascha Fahl, and Yasemin Acar. Vccfinder: Finding potential vulnerabilities in open-source projects to assist code audits. In Proceedings of the 22Nd ACM SIGSAC Conference on Computer and Communications Security, CCS ’15, pages 426–437, New York, NY, USA, 2015. ACM.
-  Henrik Plate, Serena Elisa Ponta, and Antonino Sabetta. Impact assessment for vulnerabilities in open-source software libraries. In Software Maintenance and Evolution (ICSME), 2015 IEEE International Conference on, pages 411–420. IEEE, 2015.
-  Serena Elisa Ponta, Henrik Plate, and Antonino Sabetta. Beyond metadata: Code-centric and usage-based analysis of known vulnerabilities in open-source software. In IEEE International Conference on Software Maintenance and Evolution (ICSME), Sept 2018.
-  Antonino Sabetta and Michele Bezzi. A practical approach to the automatic classification of security-relevant commits. In 34th IEEE International Conference on Software Maintenance and Evolution (ICSME), Sept 2018.
-  Yaqin Zhou and Asankhaya Sharma. Automated identification of security issues from commit messages and bug reports. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2017, pages 914–919, New York, NY, USA, 2017. ACM.