HomeEventsRayDP: Build Large-scale End-to-end Data Analytics and AI Pipelines Using Spark and Ray

Ray Summit

RayDP: Build Large-scale End-to-end Data Analytics and AI Pipelines Using Spark and Ray

View Slides >>>

A large-scale end-to-end data analytics and AI pipeline usually involves data processing frameworks such as Spark for massive data preprocessing, and ML/DL frameworks for distributed training on the preprocessed data. A primitive setup is to use two separate clusters and glue multiple jobs. Other solutions include running deep learning frameworks in a Spark cluster, or use workflow orchestrators like Kubeflow to stitch distributed programs. All these options have their own limitations. We introduce Ray as a single substrate for distributed data processing and machine learning. We also introduce RayDP which allows you to start a Spark job on Ray in your python program and utilize Ray's in-memory object store to efficiently exchange data between Spark and other libraries. We will demonstrate how this makes building an end-to-end data analytics and AI pipeline simpler and more efficient.

Speakers

Carson Wang

Carson Wang

Software Engineering Manager, Intel Data Analytics Software Group, Intel

Other Events

Ray Summit 2026

08 . 24 . 2026  ,  07:00 AM (PST)

Ray Summit 2024

09 . 30 . 2024  ,  03:00 PM (PST)

Ray Summit 2023

09 . 18 . 2023  ,  03:30 PM (PST)