OpenML-Python: an extensible Python API for OpenML
OpenML is an online platform for open science collaboration in machine learning, used to share datasets and results of machine learning experiments. In this paper we introduce OpenML-Python, a client API for Python, opening up the OpenML platform for a wide range of Python-based tools. It provides easy access to all datasets, tasks and experiments on OpenML from within Python. It also provides functionality to conduct machine learning experiments, upload the results to OpenML, and reproduce results which are stored on OpenML. Furthermore, it comes with a scikit-learn plugin and a plugin mechanism to easily integrate other machine learning libraries written in Python into the OpenML ecosystem. Source code and documentation is available at https://github.com/openml/openml-python/.
TBD20191-510/19TBDTBDMatthias Feurer, Jan N. van Rijn, Arlind Kadra, Pieter Gijsbers, Neeratyoy Mallik, Sahithya Ravi, Andreas Müller, Joaquin Vanschoren and Frank Hutter \ShortHeadingsOpenML-Python: an extensible Python API for OpenMLFeurer et al. \firstpageno1
Python, Collaborative Science, Meta-Learning, Reproducible Research
OpenML is a collaborative online machine learning (ML) platform, meant for sharing and building on prior empirical machine learning research (Vanschoren et al., 2014).
It goes beyond open data repositories, such as UCI (Dua and Graff, 2017), PMLB (Olson et al., 2018), the ‘datasets’ submodules in scikit-learn (Pedregosa et al., 2011) and tensorflow (Abadi et al., 2016), and the closed-source data sharing platform at Kaggle.com, since OpenML also collects millions of shared experiments on these datasets, linked to the exact ML pipelines and hyperparameter settings, and includes comprehensive logging and uploading functionalities which can be accessed programmatically via a REST API. However, sharing ML experiments adds significant complexity to most people’s workflows.
OpenML-Python is a seamless integration of OpenML into the popular Python ML ecosystem111https://github.blog/2019-01-24-the-state-of-the-octoverse-machine-learning/, that takes away this complexity by providing easy programmatic access to all OpenML data and automating the sharing of new experiments.222Other clients already exist for R (Casalicchio et al., 2017) and Java (van Rijn, 2016). In this paper, we introduce OpenML-Python’s core design, showcase its extensibility to new ML libraries, and give code examples for several common research tasks.
2 Use cases for the OpenML-Python API
OpenML-Python allows for easy dataset and experiment sharing by handling all communication with OpenML’s REST API. In this section, we briefly describe how the package can be used in several common machine learning tasks and highlight recent uses.
Working with datasets. OpenML-Python can retrieve the thousands of datasets on OpenML (all of them, or specific subsets) in a unified format, retrieve meta-data describing them, and search through them with filters. Datasets are converted from OpenML’s internal format into numpy, scipy or pandas data structures, which are standard for ML in Python. To facilitate contributions from the community, it allows people to upload new datasets in only two function calls, and to define new tasks on them (combinations of a dataset, train/test split and target attribute).
Publishing and retrieving results. Sharing empirical results allows anyone to search and download them in order to reproduce and reuse them in their own research. One goal of OpenML is to simplify the comparison of new algorithms and implementations to existing approaches by comparing to the results on OpenML. To this end we also provide an interface for integrating new machine learning libraries with OpenML and we have already integrated scikit-learn. OpenML-Python can then be used to set up and conduct machine learning experiments for a given task and flow (an ML pipeline including hyperparameters and random states), and publish reproducible results.
Use cases in published works. OpenML-Python has already been used to scale up studies with hundreds of consistently formatted datasets (Feurer et al., 2015; Fusi et al., 2018), supply large amounts of meta-data for meta-learning (Perrone et al., 2018), answer questions about algorithms such as hyperparameter importance (van Rijn and Hutter, 2018) and facilitate large-scale comparisons of algorithms (Strang et al., 2018).
3 High-level Design of OpenML-Python
The OpenML platform is organized around several entity types which describe different aspects of a machine learning study. It hosts datasets, tasks that define how models should be evaluated on them, flows that record the structure and other details of ML pipelines, and runs that record the experiments evaluating specific flows on certain tasks. For instance, an experiment (run) shared on OpenML can show how a random forest (flow) performs on ‘iris’ (dataset) if evaluated with 10-fold cross-validation (task), and how to reproduce that result. In OpenML-Python, all these entities are represented by classes, each defined in their own submodule. This implements a natural mapping from OpenML concepts to Python objects. While OpenML is an online platform, we facilitate offline usage as well.
Plugins. To allow users to automatically run and share machine learning experiments with different libraries through the same OpenML-Python interface, we designed a plugin interface that standardizes the interaction between machine learning library code and OpenML-Python. We also created a plugin for scikit-learn (Pedregosa et al., 2011), as it is one of the most popular Python machine learning libraries. This plugin can be used for any library which follows the scikit-learn API (Buitinck et al., 2013).
A plugin’s responsibility is to convert between the libraries’ models and OpenML flows, interact with its training interface and format predictions.
For example, the scikit-learn plugin can convert an OpenMLFlow to an Estimator (including hyperparameter settings), train models and produce predictions for a task, and create an OpenMLRun object to upload the predictions to the OpenML server.
plugin also handles advanced procedures, such as scikit-learn’s random search or grid search and uploading its traces (hyperparameters and scores of each model evaluated during search).
We are working on more plugins, and anyone can
contribute their own using the scikit-learn plugin
implementation as a reference.
We show two example uses of OpenML-Python to demonstrate its API’s simplicity. First, we show how to retrieve results and evaluations from the OpenML server in Figure 1 (generating the plot on the right). Second, in Figure 2 we show how to conduct experiments on a benchmark suite (Bischl et al., 2019). Further examples, including how to create datasets and tasks and how OpenML-Python was used in previous publications, can be found in the online documentation.333We provide documentation and code examples on http://openml.github.io/openml-python and host the project on http://github.com/openml/openml-python.
5 Project development
The project has been set up for development through community effort from different research groups, and has received contributions from numerous individuals. The package is developed publicly through Github which also provides an issue tracker for bug reports, feature requests and usage questions. To ensure a coherent and robust code base we use continuous integration for Windows and Linux as well as automated type and style checking. Documentation is also rendered on continuous integration servers and consists of a mix of tutorials, examples and API documentation.
For ease of use and stability, we use well-known and established third-party packages where needed. For instance, we build documentation using the popular sphinx Python documentation generator444http://www.sphinx-doc.org 5https://sphinx-gallery.github.io/, use an extension to automatically compile examples into documentation and Jupyter notebooks5 , and employ standard open-source packages for scientific computing such as numpy, scipy (Virtanen et al., 2019), and pandas (McKinney, 2010). The package is written in Python3 and open-sourced with a 3-Clause BSD License.3
OpenML-Python allows easy interaction with OpenML from within Python. It makes it easy for people to share and reuse the data, meta-data, and empirical results which are generated as part of an ML study. This allows for better reproducibility, simpler benchmarking and easier collaboration on ML projects. Our software is shipped with a scikit-learn plugin and has a plugin mechanism to easily integrate other ML libraries written in Python.
MF, NM and FH acknowledge funding by the Robert Bosch GmbH. AK, JvR and FH acknowledge funding by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme under grant no. 716721. JV and PG acknowledge funding by the Data Driven Discovery of Models (D3M) program run by DARPA and the Air Force Research Laboratory. The authors also thank Bilge Celik, Victor Gal and everyone listed at https://github.com/openml/openml-python/graphs/contributors for their contributions.
- Abadi et al. (2016) M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. arXiv:1603.04467 [cs.DC], 2016.
- Bischl et al. (2019) B. Bischl, G. Casalicchio, M. Feurer, F. Hutter, M. Lang, R. G. Mantovani, J. N. van Rijn, and J. Vanschoren. OpenML Benchmarking Suites. arXiv:1708.03731v2 [cs.LG], 2019.
- Buitinck et al. (2013) L. Buitinck, G. Louppe, M. Blondel, F. Pedregosa, A. Müller, O. Grisel, V. Niculae, P. Prettenhofer, A. Gramfort, J. Grobler, et al. API design for machine learning software: experiences from the scikit-learn project. In ECML PKDD LML Workshop, 2013.
- Casalicchio et al. (2017) G. Casalicchio, J. Bossek, M. Lang, D. Kirchhoff, P. Kerschke, B. Hofner, H. Seibold, J. Vanschoren, and B. Bischl. OpenML: An R package to connect to the machine learning platform OpenML. Computational Statistics, 32(3), 2017.
- Dua and Graff (2017) D. Dua and C. Graff. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
- Feurer et al. (2015) M. Feurer, A. Klein, K. Eggensperger, J. T. Springenberg, M. Blum, and F. Hutter. Efficient and Robust Automated Machine Learning. In Proc. of NeurIPS’15, 2015.
- Fusi et al. (2018) N. Fusi, R. Sheth, and M. Elibol. Probabilistic Matrix Factorization for Automated Machine Learning. In Proc. of NeurIPS’18. 2018.
- McKinney (2010) W. McKinney. Data Structures for Statistical Computing in Python. In Proc. of SciPy, 2010.
- Olson et al. (2018) R. S. Olson, W. La Cava, Z. Mustahsan, A. Varik, and J. H. Moore. Data-driven Advice for Applying Machine Learning to Bioinformatics Problems. In Proc. of PSB’18, pages 192–203, 2018.
- Pedregosa et al. (2011) F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, et al. Scikit-learn: Machine Learning in Python. JMLR, 12, 2011.
- Perrone et al. (2018) V. Perrone, R. Jenatton, M. Seeger, and C. Archambeau. Scalable Hyperparameter Transfer Learning. In Proc. of NeurIPS’18. 2018.
- Strang et al. (2018) B. Strang, P. van der Putten, J. N. van Rijn, and F. Hutter. Don’t Rule Out Simple Models Prematurely: A Large Scale Benchmark Comparing Linear and Non-linear Classifiers in OpenML. In Proc. of IDA XVII, 2018.
- van Rijn (2016) J. N. van Rijn. Massively Collaborative Machine Learning. PhD thesis, Leiden University, 2016.
- van Rijn and Hutter (2018) J. N. van Rijn and F. Hutter. Hyperparameter Importance Across Datasets. In Proc. of KDD’18, 2018.
- Vanschoren et al. (2014) J. Vanschoren, J. van Rijn, B. Bischl, and L. Torgo. OpenML: Networked science in machine learning. SIGKDD, 15(2):49–60, 2014.
- Virtanen et al. (2019) P. Virtanen, R. Gommers, T. E. Oliphant, M. Haberland, T. Reddy, D. Cournapeau, E. Burovski, P. Peterson, W. Weckesser, J. Bright, et al. SciPy 1.0-Fundamental Algorithms for Scientific Computing in Python. arXiv:1907.10121 [CS:MS], 2019.