In today's AI landscape, your ability to process data at scale isn't just a technical challenge – it's a competitive differentiator. Whether your teams are running batch inference and embedding generation or statistical analysis against massive unstructured datasets, every hour of wasted compute directly impacts both your bottom line and your time-to-market.
While open source Ray Data provides a solid foundation to distribute Python-based data processing jobs, the performance and reliability requirements of large scale production AI workloads require more execution efficiency – especially when you're competing in a rapidly evolving market.
Today, we're excited to announce three significant improvements to Anyscale’s proprietary RayTurbo Data. These improvements transform how your teams work with large-scale data, dramatically reducing both processing times and operational risks with:
Job-level checkpointing to easily resume interrupted batch inference pipelines
Vectorized aggregations to speed up computing statistics across large datasets
Intelligent operator reordering with a focus on filter and projection operations
Combined, these enhancements can bring up to 5x speed up – when comparing Ray Data (OSS) vs. RayTurbo Data (Anyscale) – to your large scale pipelines based on our benchmarks using the TPC-H Orders dataset.
In production environments, reliability isn't optional – it's essential. Not only are failed jobs expensive, but they cost your team hours in development time and agility, stalling progress and reducing your competitive advantage.
That’s why we’re introducing Job-level Checkpointing, a proprietary feature available only on the Anyscale platform. Now your inference workloads can resume exactly where they left off after manual or automatic cluster shutdowns with built-in state persistence for Ray tasks and actors – avoiding wasting expensive compute resources and keeping tight delivery schedules on track.
While Ray Data already retries individual tasks when worker nodes fail, more significant failures – like head node crashes, OOMs, or cluster terminations – previously meant restarting the entire job from scratch. For long-running batch inference jobs processing millions of records, this restart penalty could mean hours or days of lost work.
With job-level checkpointing in RayTurbo, your pipelines can seamlessly resume where they left off:
Execution progress is continuously tracked and preserved at the row level
Upon restart, you can configure Ray Data to read from the prior state
Resume work precisely where you left off, processing only the remaining rows
This creates a more predictable execution pattern for large-scale AI workloads. Instead of worrying about potential failure points, you can focus on the job's output.
RayTurbo Data now persists row-level execution state to durable storage. Currently, job-level checkpointing supports pipelines that read from a source, apply row-preserving operators (map, map_batches, filter, flat_map), and write to a sink. Fan-out or aggregation operators will be supported in an upcoming release.
The implementation is straightforward:
As a user, first configure the checkpointing with a column containing unique row IDs
During execution, each completed row's ID is persisted to the sink
Upon restart, Ray reconstructs the execution graph, identifies completed rows, and processes only the remaining items
Checkpoints can be written to any object store or distributed file system accessible to your cluster: S3, GCS, Azure Blob, HDFS, or an NFS share.
To implement this feature, ensure you're using Ray 2.43+ on Anyscale, then configure your checkpoint settings:
1from ray.data import DataContext
2from ray.anyscale.data.checkpoint import CheckpointConfig, CheckpointBackend
3
4ctx = DataContext.get_current()
5ctx.checkpoint_config = CheckpointConfig(
6 id_column="id", # Unique row identifier
7 checkpoint_path="s3://my_bucket/ckpt/my_inference_job" # Durable path
8)
Then build your pipeline as usual:
1ds = ray.data.read_parquet("s3://inference_input/")
2
3# Any map‑family transformations
4results = (
5 ds
6 .map_batches(run_llm)
7 .filter(lambda r: r["score"] > 0.5)
8)
9
10results.write_parquet("s3://inference_output/")
If your cluster is interrupted halfway through processing, simply rerun the same script on a new cluster – RayTurbo Data will automatically identify completed rows and resume processing from where it left off. This practical resilience means your batch jobs can now handle interruptions without requiring time-consuming full restarts.
Data aggregations are a common operation in AI pipelines, powering key steps such as normalization and one-hot encoding.
RayTurbo Data now includes fully vectorized aggregations that move computation from Python to optimized native code. This approach eliminates the performance penalties of Python's interpreter while maximizing throughput on modern CPU architectures.
Building on the foundation of the new hash-based shuffle, we've enhanced RayTurbo Data's aggregation capabilities with:
Native vectorized execution of aggregation operations
Optimized memory usage during group-by operations
These optimizations are particularly valuable for feature engineering, data summarization, and analytics workloads where you're frequently computing statistics across large datasets. The performance improvements scale with dataset size, providing the most benefit for your largest, most complex aggregation tasks.
Processing data efficiently isn't just about execution speed – it's also about minimizing unnecessary work. RayTurbo Data now includes enhanced optimizer rules that automatically reorder operations in your data pipelines for better performance.
These optimizer rules focus primarily on filter and projection operations, intelligently reordering them based on principles from relational algebra. By pushing filters earlier in the execution plan and optimizing column selection, your pipelines process less data and complete faster.
These optimizations happen automatically without requiring any changes to your code. You can continue writing pipelines in the most natural, readable way while the optimizer handles performance tuning behind the scenes.
We've discussed the technical improvements, but how much difference do they actually make for real workloads? To answer this question, we ran comprehensive benchmarks comparing RayTurbo Data against open source Ray Data.
Our test environment consisted of a cluster with 1 m7i.4xlarge (16 CPU / 64 GB) head node and 5 m7i.16xlarge (64 CPU / 256 GB) worker nodes. Object store memory on each worker node was set to 128GB, and num_cpus on the head node was set to 0.
We run 2 workloads on the TPC-H Orders dataset with varying scale factors 100, 1000, and 10000 (which correspond to roughly 100M, 1B, and 10B elements):
TPC-H Q1: This workload is an aggregation-heavy workload from the TPC-H benchmark.
Preprocessing with filters and projections: This workload consists of a couple filters and column selections, followed by a chain of imputation and normalization operators (via the SimpleImputer and StandardScaler preprocessors).
In the below table, we measure the runtime improvement between open source Ray Data and RayTurbo Data for TPC-H Q1. This particular test evaluates the performance benefit of vectorized aggregations. We see that compared to the open source implementation, RayTurbo improvements range from 1.6x to 2.6x in terms of job completion time.
Fig 1. Runtime improvement between open source Ray Data and RayTurbo Data for TPC-H Q1
We next measure the runtime improvement between open source Ray Data and RayTurbo Data on the preprocessing workload with filters and column selections. This test primarily demonstrates the performance benefit of improved optimization rules. We see that compared to the open source implementation, RayTurbo improvements range from 3.3x to 4.9x in terms of job completion time.
Fig 2. Runtime improvement between open source Ray Data and RayTurbo Data on the preprocessing workload with filters and column selections.
Ready to see how these improvements can accelerate your AI workflows? Get started with RayTurbo Data on the Anyscale Platform today. Sign up now and receive $100 in free credits to experience the performance and reliability advantages for yourself.
Whether you're running batch inference, generating embeddings, or building feature pipelines, RayTurbo Data provides the foundation you need to process data at AI scale.