Distributed data loading and compute

ML and the need for distributed dataset abstraction

Challenge

Data loading and preprocessing pipelines can be slow, resource inefficient, and unscalable. These are all issues when looking to train models or conduct batch inference. In the case of the former, it can result in poor training throughput and low GPU utilization as trainers become bottlenecked. Poor batch inference throughput and additional low GPU utilization are also issues.

Solutions

Ray Datasets atop Ray and Anyscale, leverage Ray’s task, actor, and object APIs to enable large-scale machine learning (ML) ingest, training, and inference, all within a single Python application. Ray Datasets:

  • Simplifies parallel and pipelined data processing

  • Provides a higher-level API

  • Internally handles data batching, tasks parallelism and pipelining, and memory management

Efficient and scalable data loading and preprocessing

Last-mile preprocessing

Ray Datasets supports data preprocessing transformations commonly performed just before model training and model inference, which we refer to as last-mile preprocessing. These transformations are carried out via a few key operations: mapping, groupbys + aggregations, and random shuffling.

Scalable parallel I/O

Ray Datasets aims to be a universal parallel data loader, data writer, and exchange format, providing a narrow data waist for Ray applications and libraries with which to interface.

Data format compatibility

Using Arrow’s I/O layer means support for many tabular file formats (e.g., JSON, CSV, Parquet) and storage backends (e.g, local disk, S3, GCS, Azure Blob Storage, HDFS) Beyond tabular data, Datasets also supports parallel reads and write of NumPy, text, and binary files.

Data framework compatibility

Ray Datasets allows for bidirectional in-memory data exchange with many popular distributed frameworks when they run on Ray, such as Spark, Dask, Modin, and Mars, not to mention Pandas and NumPy for small local in-memory data. It also provides an exchange API for both PyTorch and Tensorflow.

Stateful GPU tasks

Ray Datasets supports running stateful computations on GPUs for batch inference on large datasets.

Iterate and move to production fast with Ray Serve and Anyscale

Already using open source Ray?

Get started your existing workloads to Anyscale with no code changes. Experience the magic of infinite scale at your fingertips.