Webinar

Ray Datasets: Scalable data preprocessing for distributed ML

Wednesday, February 23, 5:00PM UTC

Ray Datasets is a Ray-native distributed dataset library that serves as the standard way to load, process, and exchange data in Ray libraries and applications. It features performant distributed data loading, flexible parallel compute operations, and comprehensive datasource compatibility and distributed framework integrations, all behind an incredibly simple API.

Register for the webinar and get a first-hand look at how Ray Datasets:

  • Provides hyper-scalable parallel I/O to most popular storage backends and file formats

  • Supports common last-mile preprocessing operations, including basic parallel data transformations such as map, batched map, and filter, and global operations such as sort, shuffle, groupby, and stats aggregations

  • Efficiently integrates with data processing libraries (e.g., Spark, Pandas, NumPy, Dask, Mars) and machine learning frameworks (e.g., TensorFlow, Torch, Horovod)

Resources

Speakers

Clark Zinzow

Clark Zinzow

Software Engineer, Anyscale

Clark Zinzow is a software engineer at Anyscale. He loves working at the boundaries of service, data, and compute scale. When he's not reading papers or trapped under a stack of books, you can probably find him biking, skiing, or trapped under a different stack of books.

Alex Wu

Alex Wu

Software Engineer, Anyscale

Alex is a Software Engineer at Anyscale and a Ray committer. His main contributions are to Ray's autoscaler, and the stability, scalability, and performance of the Ray Core.