HomeBlogBlog Detail

High Performance Distributed Inference with Ray Serve LLM

Today, in partnership with the Google Kubernetes Engine (GKE) team at Google Cloud, we are announcing a major milestone in Ray Serve LLM’s throughput and latency characteristics, driven by architecture changes across the stack. We include comparisons to a known high-performance, rust-based routing framework, vllm-router, as well as a retrospective performance comparison, to illustrate the progress Ray Serve LLM has made in reducing orchestration overhead.

Ray is a popular choice for complex distributed computing batch inference pipelines with heterogeneous hardware. In addition, we believe that Ray’s powerful primitives for fault tolerance, observability, flexibility across Kubernetes and VMs will enable the next generation of optimizations as LLM inference deployments become increasingly complex.

Below, we cover three major optimizations to the Ray Serve LLM + vLLM stack: direct streaming, a new vLLM Ray executor backend, and HAProxy integration. As a result, we see up to 4.4x higher request throughput than previous versions on prefill-heavy workloads, and up to 24x higher request throughput on decode-heavy workloads.

Ray Serve LLM closes the throughput gap
Ray Serve LLM closes the throughput gap

Cumulative Effect of Optimizations: The figure above shows the cumulative effect of the incremental optimizations compared to vLLM behind vllm-router. Ray Serve LLM now matches vllm-router performance in both prefill- and decode-heavy workloads, representing a 4.4x and 24.8x improvement over the Ray Serve LLM baseline prior to the optimization effort.1

LinkWhat’s new?

Three major optimizations contribute to the Ray Serve LLM’s new performance capabilities.

LinkRay Serve LLM: Direct Streaming

Ray 2.56 introduces direct streaming mode for Ray Serve LLM. This new architecture decouples the request routing control plane from the request/response streaming data plane.

On the forward path, the HAProxy ingress load balancer queries an ingress request router with the request content for a routing decision, based on a user-configured routing policy. Next, HAProxy establishes a direct HTTP connection with the selected target replica and streams tokens directly back to the client.

The new design resolves a bottleneck in the legacy architecture where the intermediate routing deployment (OpenAiIngress) was also responsible for forwarding response tokens back to HAProxy, taxing its event loop and adding to time per output token (TPOT). Try this out by setting RAY_SERVE_LLM_ENABLE_DIRECT_STREAMING=1. See docs for usage.

Ray Serve Application
Ray Serve Application

Ray Serve LLM Direct Streaming: In the figure above, LLMRouter serves as the direct streaming application’s ingress request router. After serving a routing decision HAProxy can establish a connection directly to the target replica for data-plane communication. OpenAiIngress was the intermediate routing deployment used in the legacy architecture.

LinkvLLM: Ray Executor Backend V2

The revamped Ray backend for vLLM, RayExecutorV2, is enabled by default in vLLM 0.21.0 and combines the process management capabilities with the battle-tested feature set of the mp backend’s data and control planes. In addition, the new Ray backend facilitates the inheritance of other features such as asynchronous scheduling.

LinkRay Serve: HAProxy

In Ray 2.55, we released two major optimizations to Ray Serve: a C-based, HAProxy ingress load balancer and high throughput mode optimizations. For LLM serving, this also included disabling TCP datagram buffering (Nagle’s algorithm) by default for improved streaming performance. Details are covered in the announcement blogpost and docs.

In Ray 2.56, HAProxy is available in all rayproject/ray container images, including rayproject/ray-llm:2.56-py312-cu130, our recommended container image for LLM serving, which includes extras from the vLLM base images, such as DeepGEMM.

If the Ray docker images can’t be used, in Ray 2.56, HAProxy can be installed via pip install ray-haproxy and enabled with RAY_SERVE_EXPERIMENTAL_PIP_HAPROXY=1. The binary will be automatically included and enabled with pip install ray[serve] in Ray 2.57.

LinkBenchmarks

We considered workloads with varying input sequence length (ISL) to output sequence length (OSL) ratios to simulate generic prefill- and decode-heavy workloads, and a multi-turn agentic workload to demonstrate request routing and cache reuse capabilities. In particular, these were:

  • Randomized prefill-heavy workload with ISL=8000, OSL=50

  • Randomized decode-heavy workload with ISL=50, OSL=500

  • Simulated prompt and traffic pattern traces from a multi-turn coding agent capped at 20 turns

The random workloads are intended to isolate orchestration due to the lack of prefix-caching benefits in the workload. For example, prefill-heavy workloads tend to highlight time to first token (TTFT), while decode-heavy workloads highlight time per output token (TPOT). For these experiments, we sweep concurrency and measure TTFT, TPOT and throughput for each of the tested frameworks after a set of warm up requests to eliminate cold start artifacts. 

For the third case, we generated a synthetic agentic workload using Dynamo’s aiperf benchmark suite. With this benchmark suite, we are able to describe scenarios like number of multi-turn coding sessions, distribution of wait times for tools and human interactions and number of shared or separate context tokens for sessions. In particular, we emulated a workload with the following characteristics:

  • Fixed number of 20 turns per session

  • Mean initial context = 25,000 tokens and median = 24,000 tokens

  • Mean new tokens = 1,000 and median = 400, modeling short and long tool call responses

  • Mean generation length = 230 and median = 70

  • Median inter-turn latency of 1.2 seconds

  • Effective shared prefix rate of 96% per session

This workload simulates traffic patterns coming from a coding agent with simulated wait times between turns when the agent is waiting on tool calls. We can use this workload to compare different routing policies as well as frameworks. In particular we compared:

  • vllm-router’s consistent hashing algorithm

  • Ray Serve LLM with consistent hashing

For agentic workloads, we can include a session ID with requests and use a consistent hashing algorithm to do load-balancing. See the Ray Serve docs on consistent hashing for more.

To isolate framework overhead, we used very small models: Qwen/Qwen3-0.6B for eight replica trials and microsoft/Phi-tiny-MoE-instruct for the prefill/decode disaggregation and WideEP trials.

LinkResults

LinkRouting across eight Qwen3-0.6B replicas

Across all three multi-replica workloads, Ray Serve LLM matches vllm-router’s aggregate throughput at every concurrency level tested. Each row in the figure corresponds to a workload: prefill-heavy, decode-heavy, and agentic coding. Each column is an identical metric: mean TTFT, mean TPOT, and throughput measured in requests per second, comparing Ray Serve LLM to vllm-router across parameterized user request concurrencies (batch size) on the x-axis.

For the concurrency 256 random workloads, Ray Serve LLM matches or beats vllm-router on TTFT: 355ms vs. vllm-router’s 389ms on prefill-heavy workloads, and 165ms vs. 190ms on decode-heavy. Throughput tracks closely for all experiments. On the realistic agentic multi-turn workload with KV-aware/session-affinity routing, Ray Serve LLM tracks vllm-router closely on TPOT, and is slightly ahead in TTFT and request throughput.

We investigated the divergence in decode-heavy TTFT between the two frameworks, and found that TTFT matched closely from the engine perspective at concurrency 256 (14.7ms Ray Serve LLM vs. 17.7ms vllm-router mean). This suggests that the reduced client-perspective TTFT Ray Serve LLM is driven by efficiency in the HAProxy ingress dataplane.

Performance Comparison Across Workloads
Performance Comparison Across Workloads

LinkWideEP and Prefill/Decode Disaggregation on Phi-tiny-MOE

In the disaggregated 4P4D Wide-EP configuration (one DP4EP4 prefill replica, one DP4EP4 decode replica), Ray Serve LLM beats vllm-router output throughput across the full concurrency range using the same agentic workload from the eight replica scaling trials above. At high concurrency, Ray’s mean TPOT/ITL is slightly better: 13.6ms vs. vLLM-router’s 14.8ms at concurrency 256. Additionally, the effect of Ray Serve LLM’s prefill/decode disaggregation architecture is shown in reduced TTFT compared to the baseline; tokenization is done once and reused, reducing frontend overhead for long prompts. For more information on Ray Serve LLM’s prefill/decode disaggregation and Wide-EP APIs, see here.

Agentic 4p4d P:D Wide-EP Comparison
Agentic 4p4d P:D Wide-EP Comparison

LinkAcknowledgements

This milestone would not have been possible without Anyscale and Ray’s ongoing engineering collaboration with the Google Kubernetes Engine Ray team, who were key in advocating for and validating the HAProxy and Direct Streaming architectures.

You can see more details on the GKE partner blog post: Gemma 4 E2B results on B200.

LinkConclusion

With optimizations across the stack: HAProxy at the Ray Serve layer, direct streaming in Ray Serve LLM, and the v2 Ray executor backend in vLLM, we have significantly reduced the orchestration overhead that previously separated Ray Serve LLM from standalone vLLM.

Across prefill-heavy, decode-heavy, and agentic multi-turn workloads, Ray Serve LLM now matches vllm-router on aggregate throughput while preserving Ray's fault tolerance, observability, and heterogeneous-hardware primitives. These same primitives extend cleanly to disaggregated prefill/decode and wide-EP topologies, giving developers a single substrate for both the simple single-replica case and the most complex production serving patterns.

Try it out in Ray 2.56, and join us on the Ray Slack to share feedback!

LinkAppendix

LinkReproduction Notes

Benchmark code here: https://github.com/anyscale/llm-direct-streaming-benchmarks

vLLM version: 0.22.0
Ray version: 2.56 nightly
vllm-router: 0.1.14
AIPerf: 0.8.0
GPUs: 8x NVIDIA H100 80GB HBM3
GPU driver: 580.126.20
CUDA env version: 13.0.0
NCCL env version: 2.27.7
CPU: AMD EPYC 7R13 Processor
CPU topology: 192 logical CPUs, 2 sockets, 48 cores/socket,  2 threads/core, 2 NUMA nodes
Memory: 2.0 TiB


1In Ray versions prior to 2.54, we implemented a batching mechanism to mitigate Python event-loop contention in the default streaming path. This batching reduced orchestrator overhead and improved streaming performance by decreasing event-loop pressure. For the comparison shown in this chart, those batching-based mitigations were intentionally disabled. We compare the unbatched baseline of the earlier version against the unbatched configuration with the new optimizations enabled, ensuring an apples-to-apples comparison.

Explore Anyscale today

Build, run, and scale any AI workload on Ray with a multi-cloud platform built for production AI.