Ray Summit

Scaling and Unifying SciKit Learn and Spark Pipelines using Ray

Wednesday, June 23, 7:25PM UTC

Raghu Ganti, Principal Research Staff Member, IBM

View Slides >>>

Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as "fit" and "transform" are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark.

Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray's parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray's compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations.

Attendees will learn how pipelined workflows can be mapped to Ray's compute model and how they can both unify and accelerate their pipelines with Ray.

Speakers

Raghu Ganti

Raghu Ganti

Principal Research Staff Member, IBM

Raghu Ganti has been a Research Staff Member at IBM T. J. Watson Research Center, Yorktown Heights, since September 2010. He is a part of the Distributed Cognitive IoT department. His research interests span wireless sensor networks, privacy, and data mining. For the past several years, he has been working on spatiotemporal analytics - the analysis of moving objects and been developing various algorithms for spatiotemporally enabling IBM's big data products. Parts of his work are now embedded in products such as IBM InfoSphere Streams, SPSS statistical modeler, and InfoSphere SenseMaking.