Data loading and preprocessing pipelines can be slow, resource inefficient, and unscalable. These are all issues when looking to train models or conduct batch inference. In the case of the former, it can result in poor training throughput and low GPU utilization as trainers become bottlenecked. Poor batch inference throughput and additional low GPU utilization are also issues.
Ray Datasets atop Ray and Anyscale, leverage Ray’s task, actor, and object APIs to enable large-scale machine learning (ML) ingest, training, and inference, all within a single Python application. Ray Datasets:
Ray Datasets supports data preprocessing transformations commonly performed just before model training and model inference, which we refer to as last-mile preprocessing. These transformations are carried out via a few key operations: mapping, groupbys + aggregations, and random shuffling.
Ray Datasets aims to be a universal parallel data loader, data writer, and exchange format, providing a narrow data waist for Ray applications and libraries with which to interface.
Using Arrow’s I/O layer means support for many tabular file formats (e.g., JSON, CSV, Parquet) and storage backends (e.g, local disk, S3, GCS, Azure Blob Storage, HDFS) Beyond tabular data, Datasets also supports parallel reads and write of NumPy, text, and binary files.
Ray Datasets allows for bidirectional in-memory data exchange with many popular distributed frameworks when they run on Ray, such as Spark, Dask, Modin, and Mars, not to mention Pandas and NumPy for small local in-memory data. It also provides an exchange API for both PyTorch and Tensorflow.
Ray Datasets supports running stateful computations on GPUs for batch inference on large datasets.
Ray Datasets supports data preprocessing transformations commonly performed just before model training and model inference, which we refer to as last-mile preprocessing. These transformations are carried out via a few key operations: mapping, groupbys + aggregations, and random shuffling.
Using Arrow’s I/O layer means support for many tabular file formats (e.g., JSON, CSV, Parquet) and storage backends (e.g, local disk, S3, GCS, Azure Blob Storage, HDFS) Beyond tabular data, Datasets also supports parallel reads and write of NumPy, text, and binary files.
Ray Datasets aims to be a universal parallel data loader, data writer, and exchange format, providing a narrow data waist for Ray applications and libraries with which to interface.
Ray Datasets allows for bidirectional in-memory data exchange with many popular distributed frameworks when they run on Ray, such as Spark, Dask, Modin, and Mars, not to mention Pandas and NumPy for small local in-memory data. It also provides an exchange API for both PyTorch and Tensorflow.
Ray Datasets supports running stateful computations on GPUs for batch inference on large datasets.
Replace pipelines with simple Python scripts atop of Ray
Already using open source Ray?
Get started your existing workloads to Anyscale with no code changes. Experience the magic of infinite scale at your fingertips.