Data Ingestion

Efficient and scalable data loading and processing for your ML workloads.

Ray data provides a Python native api to ingest and process data at scale with built-in fault tolerance. Ray data leverages the PyArrow data format and allows you to read, transform, shuffle and stream data at scale for your ML workloads.

Common applications include:
  • Batch inference

  • Training data pre-processing

  • Unstructured data processing

Datasets for Parallel Compute

Datasets simplify general purpose parallel GPU and CPU compute in Ray; for instance, for GPU batch inference. They provide a higher-level API for Ray tasks and actors for such embarrassingly parallel compute, internally handling operations like batching, pipelining, and memory management.

Group 1 1

As part of the Ray ecosystem, Ray Datasets can leverage the full functionality of Ray’s distributed scheduler, e.g., using actors for optimizing setup time and GPU scheduling.

Ray Datasets is the standard way to load and exchange data in Ray libraries and applications. They provide basic distributed data transformations such as maps (map_batches), global and grouped aggregations (GroupedDataset), and shuffling operations (random_shuffle, sort, repartition), and are compatible with a variety of file formats, data sources, and distributed frameworks.

Here’s an overview of the integrations with other processing frameworks, file formats, and supported operations, as well as a glimpse at the Ray Datasets API.

Check our compatibility matrix to see if your favorite format is already supported.

image1 1 1
Data Loading and Preprocessing for ML Training

Ray Datasets are designed to load and preprocess data for distributed ML training pipelines. Compared to other loading solutions, Datasets are more flexible (e.g., can express higher-quality per-epoch global shuffles) and provides higher overall performance.

Ray Datasets are not intended as a replacement for more general data processing systems. Learn more about how Ray Datasets work with other ETL systems.

Why Ray Datasets?

A few reasons why you should choose Ray Datasets for your large-scale data processing and transformation needs.

Built for scale

Run basic data operations such as map, filter, repartition, and shuffle on petabyte-scale data in native Python code.

Distributed Arrow

With a distributed Arrow backend, it easily works with a variety of file formats, data sources, and distributed frameworks.

Ray ecosystem

Load your data once and enjoy a pluggable experience Ray once your data is in your Ray cluster with Datasets, leveraging Ray is a breeze.