Home BlogBlog Detail

Scale Robot Policy Evaluation with Ray

By Ian Jordan, PhD, Alicia Chua, Artur Niederfahrenhorst and Omar Shorbaji | June 26, 2026

Evaluating a robot foundation model means running heavy GPU inference in a closed loop along with a heavy GPU physics simulator, thousands of times over. This post shows how to disaggregate those two workloads with Ray and Isaac Lab, and scale each independently on Anyscale.

LinkIntroduction

Policy evaluation is an essential part of robotics. Real-robot evaluation is slow, expensive, and unsafe. Simulation, in contrast, is safe, repeatable, parallelizable, and cheap per trial. So evaluation in simulation (sim-eval) is an important tool.

However, while simulation removes the bottlenecks of physical evaluation, it replaces them with a distributed-systems problem. Evaluating robot foundation models means running GPU-heavy simulation and GPU-heavy policy inference together across hundreds or thousands of rollouts. At scale this spans multiple GPU and nodes.

Three challenges emerge from this:

It is a closed loop, not a batch pass. Policy inference and simulation alternate, step by step, with the output of one feeding the input of the other. You cannot decouple them into independent offline stages.
Two GPU-bound workloads cannot share a process or a GPU. A 3B vision-language-action (VLA) model wants a GPU, and Isaac Sim's physics and rendering want a GPU. Co-locate them and they contend for memory and compute, stalling each other. Put them in the same process and their runtimes conflict outright.
Hundreds of rollouts should share one policy, not reload it. Each rollout needs its own simulator, but all of them can be served by a small fleet of policy replicas. Combine the two and scaling rollouts means standing up N copies of a 3B+ model, which is wasteful and often infeasible on GPU memory.

Ray and Anyscale resolve these three difficulties directly.

Closed-loop execution: Ray tasks and actors run simulators as isolated processes across the cluster. Each simulator advances the environment step by step, calls the policy service for an action, applies that action, and repeats. This keeps evaluation online and closed-loop rather than forcing it into an offline batch pipeline.
Separate GPU-bound workloads: Ray Serve runs the policy as an autoscaling service, the same primitive used to serve LLMs in production. Isaac Lab simulators run separately, each in its own process with its own GPU allocation. The policy runtime and simulation runtime do not have to share a process or fight over the same GPU.
Shared policy replicas: Hundreds of simulators can query a smaller fleet of shared policy replicas instead of each rollout loading its own copy of the 3B VLA. You can scale policy replicas when inference is the bottleneck, and scale simulators when rollout throughput is the bottleneck.

Anyscale packages this heterogeneous, notoriously brittle robotics stack into one reproducible cluster image and runs it on a managed Ray cluster. That means you can move from a single rollout to hundreds of parallel rollouts without rebuilding the infrastructure from scratch.

In this post, we deploy NVIDIA’s GR00T-N1.7-3B VLA with Ray Serve, use it to drive a Unitree G1 humanoid in an Isaac Lab pick-and-place scene, and fan the closed-loop rollout out across the cluster.

LinkThe Architecture

The policy and the simulator live on separate GPU workers of the same Anyscale Ray cluster and communicate over HTTP. In our example we demonstrate this with a Jupyter Notebook. A driver on the head node deploys the policy with Ray Serve, then launches simulator workers as Ray tasks. Each worker drives an Isaac Lab environment and queries the policy endpoint once per action chunk.

import os
import ray
from ray import serve
from policy_server import GR00TPolicyServer

# Stage 0: connect to the cluster
ray.init(
    address="auto",
    runtime_env={"env_vars": {"HF_TOKEN": os.environ["HF_TOKEN"]}},
)

# Stage 1: serve the policy as an HTTP service (GPU)
deployment = GR00TPolicyServer.options(
    num_replicas=1,
    ray_actor_options={"num_gpus": 1},
).bind(model_path="nvidia/GR00T-N1.7-3B", embodiment_tag="REAL_G1")

serve.run(deployment, name="gr00t-policy")

# Stage 2: fan out simulators that query the shared policy (GPU)
results = ray.get([run_sim_rollout.remote(POLICY_URL) for _ in range(100)]

The rest of this post unpacks each stage and the problem it solves.

LinkPart 1: Closed-loop evaluation needs online policy serving

The problem: Policy inference turns observations (camera frames, joint state, a language instruction) into action, with low enough latency that a closed-loop rollout runs in reasonable time. And because hundreds of rollouts will run concurrently, inference becomes an N-by-M serving problem: many simulator workers should query a smaller fleet of shared policy replicas, with Ray Serve routing requests across them, rather than each simulator loading its own copy of the model.

The Ray solution: Wrap the policy in a Ray Serve deployment. GR00TPolicyServer is a plain Python class decorated with @serve.deployment and @serve.ingress(FastAPI()). It loads GR00T-N1.7-3B onto a GPU once and exposes POST /predict and GET /stats.

from fastapi import FastAPI, Request, Response
from ray import serve
import pickle, torch

app = FastAPI()

@serve.deployment(ray_actor_options={"num_gpus": 1}, max_ongoing_requests=16)
@serve.ingress(app)
class GR00TPolicyServer:
    def __init__(self, model_path="nvidia/GR00T-N1.7-3B", embodiment_tag="REAL_G1"):
        from gr00t.policy.gr00t_policy import Gr00tPolicy
        from gr00t.data.embodiment_tags import EmbodimentTag
        self.policy = Gr00tPolicy(
            embodiment_tag=EmbodimentTag.resolve(embodiment_tag),
            model_path=model_path,
            device="cuda:0",
        )

    @app.post("/predict")
    async def predict(self, request: Request):
        obs = pickle.loads(await request.body())
        with torch.no_grad():
            action_chunk, _ = self.policy.get_action(obs)
        action = {k: v.detach().cpu().numpy() for k, v in action_chunk.items()}
        return Response(pickle.dumps({"action": action}),
                        media_type="application/octet-stream")

One serve.run schedules a replica on a GPU worker, loads the model, and stands up the HTTP server. Two deployment options matter here. Note: a 3B model takes longer to load than Ray Serve's default 30-second health-check window, so we extend it. Also, the gated vision-language backbone also needs a Hugging Face token in the replica's environment, which we pass through runtime_env.

deployment = GR00TPolicyServer.options(
    num_replicas=1,
    health_check_timeout_s=300,        # 3B model load exceeds the 30s default
    ray_actor_options={
        "num_gpus": 1,
        "runtime_env": {"env_vars": {"HF_TOKEN": os.environ["HF_TOKEN"]}},
    },
).bind(model_path="nvidia/GR00T-N1.7-3B", embodiment_tag="REAL_G1")

serve.run(deployment, name="gr00t-policy")

With the endpoint live, one observation in GR00T's REAL_G1 schema produces one action chunk. The observation is a nested dict: a short stack of camera frames, the robot's joint state across arms, hands, and waist, and a natural-language instruction. The policy returns a 40-step action chunk covering every actuated part of the body.

import numpy as np, pickle, requests

identity_pose = np.array([0.3, 0, 0, 1, 0, 0, 0, 1, 0], dtype=np.float32)  # 3 pos + 6 rot
obs = {
    "video": {"ego_view": np.zeros((1, 2, 256, 256, 3), dtype=np.uint8)},  # 2-frame stack
    "state": {
        "left_wrist_eef_9d":  identity_pose[None, None],
        "right_wrist_eef_9d": identity_pose[None, None],
        "left_arm":   np.zeros((1, 1, 7), dtype=np.float32),
        "right_arm":  np.zeros((1, 1, 7), dtype=np.float32),
        "left_hand":  np.zeros((1, 1, 7), dtype=np.float32),
        "right_hand": np.zeros((1, 1, 7), dtype=np.float32),
        "waist":      np.zeros((1, 1, 3), dtype=np.float32),
    },
    "language": {"annotation.human.task_description": [["pick up the apple and place it on the plate"]]},
}

resp = pickle.loads(requests.post(f"{POLICY_URL}/predict", data=pickle.dumps(obs)).content)
for k, v in resp["action"].items():
    print(f"{k:24s} {np.asarray(v).shape}")

Round trip: 1641 ms

left_wrist_eef_9d        (1, 40, 9)
right_wrist_eef_9d       (1, 40, 9)
left_hand                (1, 40, 7)
right_hand               (1, 40, 7)
left_arm                 (1, 40, 7)
right_arm                (1, 40, 7)
waist                    (1, 40, 3)
base_height_command      (1, 40, 1)
navigate_command         (1, 40, 3)

A single forward pass takes roughly 1.6 seconds and returns 40 steps of action. That ratio is what makes a shared service practical: a simulator can execute many steps per inference call (more on that in Part 2), so a small fleet of replicas keeps many rollouts fed. Scaling that fleet is one line. Ray Serve schedules each replica on its own GPU, and clients load-balance across them automatically.

deployment = GR00TPolicyServer.options(num_replicas=4).bind(...)

This is the same pattern used to serve large language models in production. The policy is now an ordinary HTTP microservice. It’s versioned, health-checked, and horizontally scalable. Nothing about the simulator needs to know how many replicas exist or where they run.

LinkPart 2: The simulator and the policy cannot share a process or a GPU

The problem: The natural first instinct is to import the simulator and the policy into one Python process and call the model in a loop. While this can work, it is not optimal due to resource contention: Isaac Sim's GPU physics and rendering compete with the 3B model for the same VRAM and compute, so co-located on one GPU each stalls the other.

The solution: Run each simulator as an isolated process separate from policy inference, a Ray task that gets its own clean Python interpreter and event loop, and have it talk to the policy over the HTTP endpoint from Part 1 rather than through an in-process handle. This places policy and simulator on different GPUs, so neither blocks the other. The cost is a network hop per inference, which is exactly what action chunking amortizes.

The control loop is small. The simulator queries the policy only when its current action chunk runs out, executing action_horizon steps of physics per call:

def query_policy(policy_url, obs, timeout=120.0):
    r = requests.post(policy_url.rstrip("/") + "/predict",
                      data=pickle.dumps(obs), timeout=timeout)
    r.raise_for_status()
    return pickle.loads(r.content)

action_horizon = 8            # execute 8 of GR00T's 40 steps before re-querying
chunk_idx = action_horizon    # force a query on the very first step

for step in range(max_steps):
    if chunk_idx >= action_horizon:
        action_chunk = query_policy(POLICY_URL, obs)["action"]   # HTTP POST /predict
        chunk_idx = 0
    obs, reward, done, info = env.step(action_chunk, step_idx=chunk_idx)
    chunk_idx += 1
    if done:
        break

With action_horizon = 8, one ~1.6-second inference covers eight physics steps, so the GPU running the policy serves eight times as many simulators as a naive query-every-step loop would.

The remaining work is translation. GR00T and Isaac Lab speak different schemas, and bridging them is the genuinely fiddly part of wiring any VLA into a simulator. On the way in, Isaac Lab's observation dict becomes GR00T's nested REAL_G1 modality dict. This is a two-frame ego_view stack, and end-effector poses encoded as 9-D vectors (3 translation values plus the first two rows of the rotation matrix). On the way out, GR00T's nine-key, 40-step action chunk collapses into the flat (1, 28) joint command the Isaac Lab task expects:

def _flatten_action(self, chunk, step_idx=0):
    """One step of GR00T's 9-key chunk -> Isaac Lab's flat (1, 28) joint action."""
    def pick(key, n):
        return np.asarray(chunk[key], dtype=np.float32)[0, step_idx][:n]

    flat = np.concatenate([
        pick("left_arm", 7), pick("right_arm", 7),    # 7 + 7
        pick("left_hand", 7), pick("right_hand", 7),  # 7 + 7  = 28
    ])
    return flat[None, :].astype(np.float32)           # (1, 28)

Once that glue is in place, the policy and simulator communicate across an HTTP boundary while running in separate processes on separate GPUs. The simulator still waits for each new action chunk, but action chunking amortizes that blocking call across multiple physics steps and avoids GPU/runtime contention with the policy server.

LinkPart 3: Policy as a Service

The problem: To say anything about a checkpoint you need many rollouts, across seeds and starting states, and you need them to run in parallel or evaluation becomes the slowest step in the entire model-development loop. Worse, the software stack that makes a single rollout possible is painful to assemble on one machine and far harder to reproduce identically across a cluster of GPU workers. It spans Isaac Lab, GR00T, a gated VLM backbone, pinned versions of transformers and flash-attn, and several runtime patches just to load the model. In most robotics teams this reproducibility problem, not the modeling, is what actually gates evaluation throughput.

The solution – fan out with Ray on Anyscale: Because the policy is a shared HTTP service and each simulator is an independent Ray task, scaling from one rollout to a hundred or more is easy.. Each task claims a GPU, runs its rollout, queries the shared policy fleet, and saves its own result.

  @ray.remote(num_gpus=1, runtime_env={"env_vars": {"HF_TOKEN": os.environ["HF_TOKEN"]}})
  def run_sim_rollout(policy_url, seed=0):
      # Launch the simulator in its own process (see Part 2: Isaac Sim needs a
      # clean interpreter and event loop), pointed at the shared policy endpoint.
      import subprocess
      subprocess.run(
          ["python", "sim_worker.py", "--policy-url", policy_url, "--seed", str(seed)],
          check=True,
      )
      return load_results(seed)   # the subprocess writes its GIF + metrics

  results = ray.get([
      run_sim_rollout.remote(POLICY_URL, seed=s) for s in range(100)
  ])

This is where the disaggregation pays off. Simulators and policy replicas scale independently. If rollouts are starved waiting on inference, Ray Serve can autoscale the policy fleet by adding replicas and routing requests across them. If the policy fleet is underutilized, you can fan out more simulator workers, with Anyscale scaling the underlying GPU cluster to match demand. Neither change touches the other side's code. The same runtime_env mechanism contains the gated token for the Serve replica and is handed to every simulator task, so credentials propagate across the cluster without manual setup.

LinkThe payoff: evaluation as a scalable service

Putting it together, evaluation stops being a fragile, serial, hardware-bound chore and becomes a service you scale like any other inference workload. The policy fleet reports its own health and throughput through the same HTTP interface. That is useful both as a liveness signal and as a record of how hard the fleet is working across an evaluation sweep:

import requests
print(requests.get(f"{POLICY_URL}/stats").json())

{
  "model_path": "nvidia/GR00T-N1.7-3B",
  "embodiment_tag": "REAL_G1",
  "num_params_B": 3.144016,
  "device": "cuda:0",
  "total_calls": 1,
  "avg_latency_ms": 1619.57,
  "load_time_s": 26.79
}

Scaling up does not change a single line of rollout logic. Run more evaluations by extending the simulator list comprehension. Serve them faster by raising num_replicas. Add GPU workers when you are simulation-bound, add policy replicas when you are inference-bound. The same Ray primitives that scale LLM inference to thousands of concurrent requests scale robot-policy evaluation to thousands of concurrent rollouts.

LinkWhat's Next

If you want to try this yourself:

Run it on Anyscale. Deploy the policy and fan out Isaac Lab rollouts on a managed GPU cluster with the stack pre-built and weights pre-cached, with zero infrastructure setup.
Swap the checkpoint. The deployment takes model_path as an argument, so moving from a base model to a task fine-tune (for example NVIDIA's nvidia/GR00T-N1.6-G1-PnPAppleToPlate, post-trained on this exact pick-and-place task is a one-line change. The platform stays the same.
Scale the evaluation sweep. Combine num_replicas on the policy with a larger rollout fan-out to evaluate a checkpoint across hundreds of seeds and starting conditions in parallel.
The Ray Serve documentation covers deployments, autoscaling, and multi-replica serving.

Robotics evaluation is evolving into an infrastructure problem. As robotics foundation models grow larger and simulation workloads grow in realism, evaluation begins to resemble modern AI serving: distributed GPU inference fleets, heterogeneous scheduling, autoscaling services, and tightly coordinated closed-loop workloads. The key architectural shift is disaggregation. Policies and simulators scale independently, connected through Ray Serve and orchestrated with Ray Core across the cluster. The same distributed systems patterns that scaled LLM inference are now becoming necessary for robotics simulation and evaluation.