Vision-Language-Action (VLA) models have emerged at the center of modern robotics and embodied AI. They combine perception (vision), reasoning (language), and control (action) into a single system trained on large, multimodal datasets that interact with the physical world.
To remain competitive, robotics teams are shifting away from training classic vision models for perception and motion planning. Instead, teams are fine-tuning these models to their proprietary data, hardware or both.
To stay ahead, teams have to move fast – curating data, iterating on models, and running more experiments without letting compute costs spiral out of control. As experimentation moves towards use of larger datasets and larger models, this quickly exposes the limits of single-node workflows and custom scripts for ad hoc parallelism, creating the need for a framework and platform that can scale data pipelines, training, and evaluation together without slowing developers down.
Ray has emerged as the de facto open-source framework that can meet the large data processing required to scale robotics and embodied AI workloads, where distributed data preparation, training, simulation, and evaluation must run in parallel across large GPU clusters.
Leading robotics teams, including NVIDIA Isaac GROOT, The Robotics and AI Institute, and Physical Intelligence, use Ray to scale VLA training, simulation rollouts, and data pipelines.
“We facilitate fault-tolerant multi-node training and data ingestion via a custom library built on top of the Ray distributed computing library” - NVIDIA GROOT 1.6
If you’re still evaluating the scaling of VLA pipelines the sections below outline the key technical and business considerations, and how Ray on Anyscale reduces both developer friction and operational risk.
For many robotics teams, VLAs represent a shift away from classic vision-only models and modular planning stacks. To remain competitive, teams now fine-tune large open VLA models or, in some cases, train task-specific architectures.
This shift changes the scale and shape of the AI workload:
Models got larger. VLA policies often use transformer backbones similar to large language and vision models. Memory pressure and training time increase quickly.
Data becomes multimodal and temporal. Datasets expand from images to video, trajectories, proprioception, and language annotations. Data preprocessing and loading become distributed problems.
Training loops integrate simulation and inference. Rollouts, evaluation, and batch inference increasingly run alongside training, especially for reinforcement learning and imitation learning.
All of this makes distributed processing a must have, not a nice to have in order to keep up experimentation velocity. Scaling from a single GPU to multiple GPUs on one machine is usually manageable, but most VLA pipelines quickly exceed the capacity of a single node and move to multi-node execution that also combines CPU-based steps.
Once pipelines span multiple nodes new failure modes appear, for example, GPU or node failures cause partial job failures, dependency drift across machines and fragmented logs and metrics.The setup effort quickly grows into undifferentiated infrastructure work such as provisioning and monitoring multi-node GPU clusters, managing data sharding and placement, building fault-tolerant checkpointing and more.
The cost of a DIY approach is not just engineering time. It shows up as slower iteration, fragile experiments, missed deadlines, and models that never reach production.
VLA development rarely involves training in isolation. It typically happens inside a broader robotics ML platform that includes simulation, rollout generation, and evaluation, whether for supervised learning, offline RL, or online reinforcement learning.
Simulation plays several roles. It generates synthetic data, evaluates policies at scale, and stress-tests models across tasks and environments. In reinforcement learning, simulation is often tightly coupled with training, with rollout workers producing experience in parallel as policies are updated.
As VLA models grow, simulation and sim-eval pipelines must scale alongside training. Single-process simulators or locally parallel rollouts quickly become bottlenecks. Teams need to scale thousands of concurrent environment instances, evaluators, or rollout workers, often independently of the training job.
Simulation and rollout generation also have different hardware requirements than training. Some simulators are CPU-bound, while others use GPUs that differ from those needed for training. Lightweight simulation may run efficiently on RTX-class GPUs, while VLA training requires high-end accelerators such as Nvidia H100s. Treating these workloads as interchangeable leads to poor utilization and unnecessary cost.
Running all workloads on a single homogeneous platform wastes resources, especially when GPUs are scarce and expensive. VLA pipelines require independent scaling, hardware-aware scheduling, and coordination across the full system. At this point, infrastructure decisions matter as much as model architecture.
When VLA pipelines scale end to end, point solutions break down. Training, simulation, rollout generation, evaluation, and data processing must all scale together, often on different hardware with different performance profiles.
Ray provides a unified distributed execution framework that matches how robotics and VLA systems are built. Instead of stitching together separate systems, Ray offers a single programming model that expresses training, simulation, and data processing within one coordinated system.
Ray handles the core distributed systems challenges VLA pipelines face:
Coordinating large numbers of concurrent workers
Scheduling across heterogeneous CPUs and GPUs
Managing data movement across nodes
Providing fault tolerance for long-running jobs
Allowing components to scale independently
This makes Ray a natural fit for robotics workloads. Simulation environments map cleanly to Ray actors. Rollout workers, evaluators, and data preprocessors can run as parallel tasks. Training jobs scale using Ray Train, allowing teams to distribute VLA fine-tuning or training. Data pipelines scale using Ray Data.
Ray also enables clean separation of workloads. Simulation and rollouts scale on CPU-heavy nodes or lower-cost GPUs, while training runs on high-end accelerators. Evaluation can scale independently, all within one execution framework.
Ray simplifies distributed execution, but operating large-scale VLA pipelines still carries significant overhead
Teams must manage:
Cluster provisioning and teardown
Environment consistency across nodes
Node failures and preemptions
Centralized observability
Cost control as GPU usage grows
In robotics, where experiments can run for days and simulation executes continuously, these issues directly impact iteration speed and cost.
Ray on Anyscale builds on open-source Ray with a managed platform designed for large, heterogeneous AI workloads.
With Anyscale, teams get:
Purpose-built developer tooling. Accelerate experimentation with integrated dev environments that elastically scale, automate dependency management and can quickly extend configs to production clusters for production runs.
Managed, heterogeneous clusters. Automatically provision and scale clusters with different node types - CPU-heavy nodes for simulation, GPU nodes for training, and burst capacity for evaluation.
Multi-cloud orchestration. Run and manage Ray workloads across multiple clouds, enabling flexible workload placement based on GPU availability and cost with reliable execution on reservations, on-demand and/or spot instances.
Production-grade fault tolerance. Unhealthy nodes are detected and replaced automatically, allowing long-running VLA training and simulation pipelines to recover without manual intervention.
Unified observability across workloads. Centralized logs, metrics, and job state across training, simulation, and evaluation - without SSH’ing into individual nodes.
Built-in cost and resource governance. Visibility into GPU utilization and workload-level resource consumption, helping teams avoid over-provisioning scarce accelerators.
With Ray on Anyscale, scaling VLA pipelines stops feeling like maintaining fragile infrastructure. Teams focus on models, data, and learning algorithms while the platform handles reliability and resource management.
Vision-Language-Action models fundamentally change the scale of robotics ML workload
VLA pipelines require distributed data processing, distributed training, distributed simulation, and multi-GPU execution
Ray provides a unified execution model for training, simulation, rollout generation, and data processing
Anyscale turns Ray into a production-ready platform for robotics-scale experimentation
In short, Ray provides the distributed compute framework VLA pipelines need. Anyscale turns it into a resilient, cost-efficient platform for robotics-scale experimentation.
Join us at our upcoming webinar: Accelerating physical AI with VLAs
Case study: Physical Intelligence Builds Adaptable Robot Intelligence with Anyscale
Video: Hybrid RL + Imitation Learning for Robotics with Ray at The Robotics and AI Institute