MMLSpark: Unifying Machine Learning Ecosystems at Massive Scales
We introduce Microsoft Machine Learning for Apache Spark (MMLSpark), an ecosystem of enhancements that expand the Apache Spark distributed computing library to tackle problems in Deep Learning, Micro-Service Orchestration, Gradient Boosting, Model Interpretability, and other areas of modern computation. Furthermore, we present a novel system called Spark Serving that allows users to run any Apache Spark program as a distributed, sub-millisecond latency web service backed by their existing Spark Cluster. All MMLSpark contributions have the same API to enable simple composition across frameworks and usage across batch, streaming, and RESTful web serving scenarios on static, elastic, or serverless clusters. We showcase MMLSpark by creating a method for deep object detection capable of learning without human labeled data and demonstrate its effectiveness for Snow Leopard conservation.
MMLSpark: Unifying Machine Learning Ecosystems at Massive Scales
Mark Hamilton ††thanks: Microsoft Applied AI, Cambridge, MA firstname.lastname@example.org Sudarshan Raghunathan Ilya Matiach ††thanks: Microsoft Azure Machine Learning, Cambridge, MA Andrew Schonhoffer Anand Raman ††thanks: Microsoft AI, Redmond, WA Eli Barzilay Minsoo Thigpen ††thanks: Microsoft AI Development Acceleration Program, Cambridge, MA ††thanks: Contributed Equally Karthik Rajendran Janhavi Suresh Mahajan Courtney Cochrane Abhiram Eswaran Ari Green
noticebox[b]Preprint. Work in progress.\end@float
As the field of machine learning has advanced, frameworks for using, authoring, and training machine learning systems have proliferated. These different frameworks often have dramatically different APIs, data models, usage patterns, and scalability considerations. This heterogeneity makes it difficult to combine systems and complicates production deployments. In this work we present Microsoft Machine Learning for Apache Spark (MMLSpark), an ecosystem that aims to unify major machine learning workloads into a single API that enables execution in a variety of distributed production grade environments. We describe the techniques and principles used to unify a representative sample of machine learning technologies, each with its own software stack, communication requirements, and paradigms. We also introduce tools for deploying these technologies as distributed real-time web services.
Throughout this work we build upon the distributed computing framework Apache Spark [spark]. Spark is capable of a broad range of workloads and applications such as fault-tolerant and distributed map, reduce, filter, and aggregation style programs. Spark clusters can adaptively resize to compute a workload efficiently (elasticity) and can run on resource managers such as Yarn, Mesos, Kubernetes, or manually created clusters. Furthermore, Spark has language bindings in several popular languages like Scala, Java, Python, R, Julia, C# and F#, making it usable from almost any project.
Since its inception, Spark has expanded its scope to support SQL, streaming, machine learning, and graph style computations [sparksql, sparkml, graphx]. This broad set of APIs allows a rich space of computations that we can leverage for our work. More specifically, we build upon the SparkML API, which is similar to the popular scikit-learn machine learning library for python [sklearn]. Like scikit-learn, all SparkML models have the same API, which makes it easy to create, substitute, and compose machine learning algorithms into “pipelines”. However, SparkML has several key advantages such as limitless scalability, streaming compatibility, support for structured datasets, broad language support, a fluent API for initializing complex algorithms, and a type system that differentiates computations based on whether they extract state (learn) from data. These properties make the SparkML API a natural and principled choice to unify the APIs of other machine learning frameworks.
Across the broader computing literature, many have turned to intermediate languages to “unify” and integrate disparate forms of computation. One of the most popular of these languages is the Hypertext Transfer Protocol (HTTP) used widely throughout internet communications. To enable broad adoption and integration of code, one simply needs to create a web-hosted HTTP endpoint or “service”. Furthermore, this intermediate language allows system components to scale independently to minimize bottlenecks. If services reside on the same machine, one can use local networking capabilities to bypass internet data transfer costs and come closer to the latency of normal function dispatch.
Companies like Microsoft, Amazon, and Google have standardized around HTTP to provide pre-built intelligent algorithms for a wide range of applications [aws-service]. This standardization enables easy use of cloud intelligence and abstracts away the implementation details and required compute. In Microsoft’s ecosystem, the Azure Cognitive Services provide intelligent solutions for text, vision, speech, search, time series, and geospatial workloads.
In this work we describe our contributions in three key areas: 1) Unifying other ecosystems with Spark. 2) Integrating Spark with the networking language HTTP and the Azure Cognitive Services. 3) Deploying any Spark computation as a distributed web service. Together, these contributions allow users to create scalable machine learning systems that draw from a wide variety of libraries and expose these contributions as web services for others to use.
3.1 Algorithms and Frameworks
3.1.1 Deep Learning
To enable GPU accelerated deep learning on Spark, we have parallelized Microsoft’s deep learning framework, the Cognitive Toolkit (CNTK) [CNTK]. This framework powers roughly 80% of Microsoft’s internal deep learning workloads and is flexible enough to create most models described in the deep learning literature. CNTK is similar to other automatic differentiation systems like Tensorflow, PyTorch, and MxNet as they all create symbolic computation graphs that automatically differentiate and compile to machine code. These tools liberate developers and researchers from the difficult task of deriving training algorithms and writing GPU accelerated code.
CNTK’s core functionality is written in C++ but exposed to C# and Python through bindings. To integrate this framework into Spark, we used the Simple Wrapper and Interface Generator (SWIG) to contribute a set of Java Bindings to CNTK [swig]. These bindings enable users to call and train CNTK models from Java, Scala and other JVM based languages and allowed us to create a SparkML transformer to distribute CNTK in Scala. We then automatically generate PySpark and SparklyR bindings for our Spark transformers. Furthermore, we have optimized our implementation to broadcast the model to each worker using Bit-Torrent broadcasting, re-use C++ objects to reduce garbage collection overhead, asynchronously mini-batch data, and share weights between local threads to reduce memory overhead. With CNTK on Spark, users can embed any deep network into parallel maps, SQL queries, and streaming pipelines. Furthermore, we have built a large cloud repository of trained models and tools to perform image classification with transfer learning. We have utilized this work for wildlife recognition, bio-medical entity extraction, and gas station fire detection [mmlspark].
3.1.2 Gradient Boosting and Decision Trees
We have also integrated the GPU enabled gradient boosting library, LightGBM, into Spark [lightgbm]. This library is one of the most popular and performant decision tree frameworks. Under the hood, LightGBM on Spark uses Message Passing Interface (MPI) communication that is significantly less chatty than SparkML’s Gradient Boosted Tree and thus, trains up to 30% faster. Like CNTK, LightGBM is written in C++ and there are bindings for use in other languages. Again, we used SWIG to contribute a set of Java bindings to LightGBM for use in Spark. However, unlike CNTK evaluation, LightGBM training involves nontrivial MPI communication between workers. To unify Spark’s API with LightGBM’s MPI communication, we transfer control to LightGBM with a Spark “MapPartitions” operation. More specifically, we communicate the hostnames of all workers to the driver node of the Spark cluster and use this information to launch an MPI ring. The LightGBM processes will then fetch needed data from their sister Spark worker processes. This integration allows users to create rich and powerful trees and forests for classification, quantile regression, and other applications.
3.1.3 Model Interpretability
In addition to integrating frameworks into Spark through transfers of control, we have also expanded SparkML’s native library of algorithms. One example is our distributed implementation of Local Interpretable Model Agnostic Explanations (LIME) [lime]. This method provides a way to “interpret” the predictions of model without reference to that model’s functional form. More concretely, LIME interprets black box functions though a locally linear approximation constructed from a sampling procedure.
For image classifiers, the intuition is that if “turning off” a patch of image dramatically changes a classifier’s output, that patch is “important”. More formally, LIME creates thousands of perturbed images by setting random chunks or “superpixels” of the image to a neutral color. Next, it feeds each of these perturbed images through the model to see how the perturbations affect the model’s output. Finally, it uses a locally weighted Lasso model to learn a linear mapping between a Boolean vector representing the “states” of the superpixels to the model’s outputs.
To interpret a classifier’s predictions for an image, one must evaluate the classifier on thousands of perturbed images to sufficiently sample the superpixel state space. Practically speaking, if it takes 1 hour to score a model on your dataset, it would take days to interpret this dataset with LIME. We have created a distributed implementation to reduce this massive computation time. LIME affords several possible distributed implementations, and we have chosen a parallelization scheme that speeds each individual interpretation. More specifically, we parallelize the superpixel decompositions over the input images. Next, we iterate through the superpixel decompositions and create a new parallel collection of “state samples” for each input image. We then perturb these images and apply the model in parallel. Finally, we fit a distributed linear model to the inner collection and add its weights to the original parallel collection. Because of this nontrivial parallelization scheme, this kind of integration benefited from a complete re-write in fast compiled Scala and Spark SQL, as opposed to using a tool like Py4J to integrate the existing LIME code into Spark.
3.2 Networking and Cloud Services: HTTP and the Cognitive Services on Spark
In Section 3.1 we explored three contributions that unify Spark with other Machine Learning tools using the Java Native Interface (JNI) and function dispatch. These methods are efficient but require re-implementing code in Scala or auto-generating wrappers from existing code. For many frameworks, these tight integrations are made impossible by differences in language, operating system, or computational architecture. For these cases, we can utilize inter-process communication protocols like HTTP to bridge the gap between systems.
We present HTTP on Spark, an integration between the entire HTTP communication protocol and Spark SQL. HTTP on Spark allows Spark users to leverage the parallel networking capabilities of their cluster to integrate any local, docker, or web service. At a high level, HTTP on Spark provides a simple and principled way to integrate framework into the Spark ecosystem. The contribution adds HTTP Request and Response types to the Spark SQL schema so that users can create and manipulate their requests and responses using SQL operations, maps, reduces, and filters. When combined with SparkML, users can chain services together, allowing Spark to function as a distributed micro-service orchestrator. HTTP on Spark also provides asynchronous parallelism, automatic batching, throttling, and exponential backoffs for failed requests. Furthermore, one can use HTTP on Spark with Kubernetes or other container orchestrators to deploy services directly onto Spark worker machines [Kubernetes]. This enables near native integration speeds as requests do not travel across machines.
We have also built on this work to create a simple and powerful integration between the Microsoft Cognitive Services and Spark. The Cognitive Services on Spark allows users to embed cloud intelligence directly into their Spark and SQL computations. This contribution aims to liberate users from low level networking details, so they can focus on creating intelligent distributed applications. Each Cognitive Service is a SparkML transformer, so users can add intelligent services to existing SparkML workflows. Additionally, every request parameter can be set with a scalar value or be vectorized with an entire dataframe column. This flexibility enables a huge family of succinct yet powerful queries. For example, by vectorizing the “subscription key” parameter, users can distribute requests across several accounts, regions, or deployments to maximize throughput and resiliency to error.
3.3 Scalable Real-Time Web Serving
Through HTTP on Spark, we have enabled Spark as a distributed web client. In this work we also contribute Spark Serving, a framework that allows Spark Clusters to operate as a distributed web . This transforms Spark into a distributed networking framework in addition to a Machine Learning framework. Spark Serving builds upon Spark’s Structured Streaming library that transforms existing Spark SQL computations into continuously running streaming queries. Structured Streaming supports a large majority of Spark primitives including maps, filters, aggregations, and joins. To convert a batch query to a streaming query, users only change a single line of code in how they read their datasets and can keep all other computational logic in place. We extend Structured Streaming to web serving by framing serving as a special case of a streaming job. More specifically, a web service is a streaming pipeline where the data source and the data sink are the same HTTP request. Under the hood, each Spark worker/executor manages a web service that en-queues incoming data in an efficient parallel data structure that supports constant time routing, addition, deletion, and load balancing across the multiple threads. Each worker converts these requests to a SQL type for an HTTP Request (The same types used in HTTP on Spark) with a unique routing ID and pushes them into the computational pipeline. To reply to incoming requests, users leverage a new data sink that uses routing IDs and SQL objects for HTTP Responses to reply to the stored request. Our choice of HTTP Request and Response Data types enables users to work with the entire breadth of the HTTP protocol for full customization. As of MMLSpark v0.14, we have integrated Spark Serving with Spark’s new Continuous Processing feature. Continuous Processing dramatically reduces the latency of streaming pipelines from to . This acceleration enables real-time web services and machine learning applications.
To the authors’ knowledge, Spark Serving is the only serving framework that leverages an existing Spark cluster. As a result, developers do not need to re-implement and export their models into other languages, such as MLeap, to create web services [mleap]. Furthermore, frameworks that require additional run-times add complexity and incur the cost of another deployment system in addition to Spark Cluster. Spark Serving can deploy any Spark computation as a web service including all of the above contributions (CNTK, LightGBM, SparkML, Cognitive Services, HTTP Services), their combinations, and their compositions. Through other open source contributions in the Spark ecosystem one can deploy Tensorflow, XGBoost, and Scikit-learn models, as well as arbitrary Python, R, Scala, and Java code.
We have used MMLSpark to power engagements in a wide variety of machine learning domains, such as text, image, and speech domains. In this work, we will highlight the aforementioned contributions in our ongoing work to use MMLSpark for wildlife conservation.
4.1 Snow Leopard Conservation
Snow leopards are dwindling due to poaching, mining, and retribution killing yet we know little about how to best protect them. Currently, researchers estimate that there are only about four thousand to seven thousand individual animals within a potential 2 million square kilometer range of rugged central Asian mountains [slt]. Our collaborators, the Snow Leopard Trust, have deployed hundreds of motion sensitive cameras across large areas of Snow Leopard territory to help monitor this population [slt]. Over the years, these cameras have produced over 1 million images, but most of these images are of goats, sheep, foxes, or other moving objects and manual sorting takes thousands of hours. Using tools from the MMLSpark ecosystem, we have created an automated system to classify and localize Snow Leopards in camera trap images without any human labeled data. This method saves the Trust hours of labor and provides data to help identify individual leopards by their spot patterns.
4.2 Unsupervised Classification
In our previous work, we used deep transfer learning with CNTK on Spark to create a system that could classify Snow Leopards in Camera trap images [mmlspark]. This work leveraged a large dataset of manually labeled images accumulated through years of intensive labelling by the Snow Leopard Trust. In this work, we show that we can avoid all dependence on human labels by using Bing Image Search to automatically curate a labeled Snow Leopard dataset. More specifically, we used our SparkML bindings for Bing Image Search to make this process easy and scalable. To create the Snow Leopard part of the dataset, we pulled the first 80 pages of the results for the “Snow Leopard” query. To create a dataset of negative images, we drew inspiration from Noise Contrastive Estimation, a mathematical technique used frequently in the Word Embedding literature [noise-contrastive]. More specifically, we generated a large and diverse dataset of random images, by using random queries as a surrogate for random image sampling. We used an existing online random word generator to create a dataframe of thousands of random queries. We used Bing Image on Spark to pull the first 10 images for each random query. After generating two datasets, we used Spark SQL to add labels, stitch them together, drop duplicates, and download the images to the cluster in only a few minutes. Next, we used CNTK on Spark to train a deep classification network transfer learning on our automatically generated dataset in order to learn an automated snow leopard classifier. Though we illustrated this process with Snow Leopard classification, the method applies to any domain that can be searched on Bing Images.
4.3 Unsupervised Object Detection
Many animal identification systems, such as HotSpotter, require more than just classification probabilities to identify individual animals by their patterns [hotspotter]. In this work, we introduce a refinement method capable of extracting a deep object detector from any image classifier. When combined with our unsupervised dataset generation technique in Section 4.2, we can create a custom object detector for any object found on Bing Image Search. This method leverages our LIME on Spark contribution to “interpret” our trained leopard classifier. These classifier interpretations often directly highlight leopard pixels, allowing us to refine our input dataset with localization information. However, this refinement operation incurs the 1000x computation cost associated with LIME, making even the distributed version untenable for real-time applications. However, we can use this localized dataset to train a real-time object detection algorithm like Faster-RCNN [fasterrcnn]. In effect, we train Faster-RCNN to quickly reproduce the computationally expensive LIME outputs. This network serves as a fast leopard localization algorithm that does not require human labels at any step of the training process. Because LIME is a model agnostic interpretation engine, this refinement technique can apply to any image classification from a any domain. Figure 1 shows a diagram of the end-to-end architecture.
We discovered that our completely unsupervised object detector closely matched human drawn bounding boxes on most images. Table 2 and 2 show that our method can approac that of a classifiers and object detectors trained on human labelled images. However, certain types of images posed problems for our method. Our network tended to only highlight the visually dominant leopards in images with more than one leopard, such as those in Figure 2. We hypothesize that this arises from our simple method of converting LIME outputs to bounding boxes. Because we only draw a single box around highlighted pixels, our algorithm has only seen examples with a single bounding box. In the future, we plan to cluster LIME pixels to identify images with bi-modal interpretations. Furthermore, the method also missed several camouflaged leopards, as in Figure 3. We hypothesize that this is an anthropic effect, as Bing only returns clear images of leopards. We plan to explore this effect by combining this Bing generated data with active learning on a “real” dataset to help humans target the toughest examples quickly.
In this work we have introduced Microsoft Machine Learning for Apache Spark, a framework that aims to integrate a wide variety of computing technologies into a single distributed API. We have contributed CNTK, LightGBM, and LIME on Spark, and have added a foundational integration between Spark and the HTTP Protocol. We built on this to integrate the Microsoft Cognitive Services with Spark and create a novel real-time serving framework for Spark models. We have also shown that by combining these technologies, one can build and deploy a deep Snow Leopard object detector without any dependence on costly human labeled data. Throughout the work, we have made no assumptions regarding the application domain. Hence, this method can extract custom object detectors for anything searchable on Bing Images. Together, our contributions allow users to create a variety of production grade machine learning applications in only a few lines of Spark code. Our contributions dramatically expand the Spark framework into several new areas of modern computing and have already enabled a new generation of distributed machine learning applications.
We would like to acknowledge the generous support from our collaborators, Dr. Koustubh Sharma, Rhetick Sengupta, Michael Despines, and the rest of the Snow Leopard Trust. We would also like to acknowledge those at Microsoft who helped fund and share this work: the Microsoft AI for Earth Program, Lucas Joppa, Joseph Sirosh, Pablo Castro, Ben Brodsky, Brian Smith, Arvind Krishnaa Jagannathan, and Wee Hyong Tok.