ML library for distributed and unstructured data processing. Anyscale supports and further optimizes Ray Data for improved performance, reliability, and scale.
faster compared to AWS SageMaker for data preprocessing
faster than Apache Spark for unstructured data preprocessing
faster at running read-intensive data workloads compared to open source Ray
cheaper for LLM batch inference compared to AWS Bedrock and OpenAI
Ray Data is a scalable data processing library for ML and AI workloads.
With flexible and performant APIs for distributed data processing, Ray Data enables offline batch inference and data preprocessing ingest for ML training. Built on top of Ray Core, it scales effectively to large clusters and offers scheduling support for both CPU and GPU resources. Ray Data also uses streaming execution to efficiently process large datasets and maintain high GPU utilization.
Ray Data offers an efficient and scalable solution for batch inference, consistently outperforming competitors:
Leverage CPUs and GPUs in the same pipeline with to increase GPU utilization and decrease costs
Ray Data automatically recovers from out-of-memory failures and spot instance preemption.
Work with your favorite ML frameworks and libraries, just at scale. Ray Data supports any ML framework of your choice—from PyTorch to HuggingFace to Tensorflow and more.
Ray Data supports a wide variety of formats including Parquet, images, JSON, text, CSV, and more, as well as storage solutions like Databricks and Snowflake.
Text Support | ||||
Image Support | - - | |||
Audio Support | Manual Manual | - - | Manual Manual | |
Video Support | Manual Manual | - - | Manual Manual | |
Video Support | - - | - - | Binary Binary | Binary Binary |
Task-Specific CPU & GPU Allocation | - - | - - | ||
Stateful Tasks | - - | - - | ||
Native NumPy Support | - - | - - | ||
Native Pandas Support | - - | |||
Model Parallelism Support | - - | - - | ||
Nested Task Parallelism | - - | - - | ||
Fast Node Launching and Autoscaling | - - | - - | - - | 60 sec 60 sec |
Fractional GPU Support | - - | Limited Limited | ||
Load Datasets Larger Than Cluster Memory | - - | - - | ||
Improved Observability | - - | - - | - - | |
Autoscale Workers to Zero | - - | Limited Limited | ||
Job Queues | - - | - - | - - | |
Priority Scheduling | - - | - - | - - | |
Accelerated Execution | - - | - - | - - | |
Data Loading / Data Ingest / Last Mile Preprocessing |
Jumpstart your development process with custom-made templates, only available on Anyscale.
Run LLM offline inference on large scale input data with Ray Data
Compute text embeddings with Ray Data and HuggingFace models.
Pre-train a Stable Diffusion V2 model with Ray Train and Ray Data
There are a variety of Anyscale customers and Ray open source users using the Ray Data library to build advanced AI applications. A few exciting examples include:
Companies Using Ray Data for Offline Batch Inference:
Companies Using Ray Data for ML Training Ingest:
Get up to 90% cost reduction on unstructured data processing with Anyscale, the smartest place to run Ray.