Ray Datasets: Scalable data preprocessing for distributed ML

Wednesday, February 23, 5:00PM UTC

Ray Datasets is a Ray-native distributed dataset library that serves as the standard way to load, process, and exchange data in Ray libraries and applications. It features performant distributed data loading, flexible parallel compute operations, and comprehensive datasource compatibility and distributed framework integrations, all behind an incredibly simple API.

Provides hyper-scalable parallel I/O to most popular storage backends and file formats
Supports common last-mile preprocessing operations, including basic parallel data transformations such as map, batched map, and filter, and global operations such as sort, shuffle, groupby, and stats aggregations
Efficiently integrates with data processing libraries (e.g., Spark, Pandas, NumPy, Dask, Mars) and machine learning frameworks (e.g., TensorFlow, Torch, Horovod)

LinkResources

Slides >>

Q&A >>

Ray GitHub >>

Ray Datasets >>

Ray discussion forum >>

Speakers

Clark Zinzow

Software Engineer, Anyscale

Clark Zinzow is a software engineer at Anyscale. He loves working at the boundaries of service, data, and compute scale. When he's not reading papers or trapped under a stack of books, you can probably find him biking, skiing, or trapped under a different stack of books.

Alex Wu

Software Engineer, Anyscale

Alex is a Software Engineer at Anyscale and a Ray committer. His main contributions are to Ray's autoscaler, and the stability, scalability, and performance of the Ray Core.

Other Events

Live Virtual Hands On Lab: Building Scalable AI with Ray

12 . 11 . 2025 , 06:00 PM (PST)

From Unity Catalog to MLflow: Experiment Tracking and Lineage for Ray

11 . 20 . 2025 , 06:00 PM (PST)

Live Virtual Hands On Lab: Building Scalable AI with Ray

09 . 24 . 2025 , 05:00 PM (PST)