Sparse mixture-of-experts (MoE) models like the DeepSeek and Qwen3 families currently represent the Pareto frontier of model quality and inference efficiency. However, optimal serving patterns, like wide-EP and prefill/decode disaggregation, require engine orchestration logic that is more complex than ordinary horizontal scaling. This is where Ray Serve comes in.
We are excited to announce new Ray Serve LLM APIs that make it easy to deploy state of the art serving patterns with vLLM. In particular, we are pleased to announce:
APIs to support wide expert parallelism
APIs to support disaggregated prefill/decode
High throughput serving validated on the Anyscale Runtime (2.4k tps/H200 on Nebius with Infiniband)
Example code for DeepSeek-style disaggregated wide-EP with Ray Serve LLM

Expert parallelism (EP) is a technique, used in conjunction with data parallelism, that distributes an MoE model’s experts across multiple GPUs with duplicated attention layers. Wide-EP adds expert load balancing, expert replication, and optimized communication kernels to further improve throughput and latency SLAs.
Deploying models via wide-EP requires coordination between engine replicas to create the expert parallel group and horizontally scale ingress API servers for high-concurrency serving. The Ray Serve LLM data parallel + expert parallel builder API makes this coordination simple. Here’s an example:
1# serve_dp.py
2from ray import serve
3from ray.serve.llm import LLMConfig, build_dp_deployment
4from ray.serve.llm.ingress import OpenAiIngress, make_fastapi_ingress
5
6DP_SIZE=16
7
8# Build config
9config = LLMConfig(
10 model_loading_config="deepseek-ai/DeepSeek-R1",
11 engine_kwargs=dict(
12 data_parallel_size=DP_SIZE,
13 enable_expert_parallel=True,
14 ),
15 runtime_env={
16 "env_vars": {
17 "VLLM_USE_DEEP_GEMM": "1",
18 "VLLM_ALL2ALL_BACKEND": "deepep_low_latency",
19 }
20 },
21)
22
23# Build the deployment and API server
24dp = build_dp_deployment(config)
25ingress_cls = make_fastapi_ingress(OpenAiIngress)
26app = serve.deployment(ingress_cls).bind(
27 llm_deployments=[dp]
28)
29
30# Run the Serve graph on Ray cluster
31serve.run(app)
32The example can be run via python serve_dp.py. In this example, the build_dp_deployment builder does the heavy lifting. The builder constructs a graph of Serve deployments: one ingress (API server) deployment and one DPServer deployment, with replicas connected via deployment handles. See the Ray Serve docs for more information about key concepts in Ray Serve.
The LLMConfig object specifies the parameters of the deployment, including model, data parallel group size, and environment variables (like DeepEP and DeepGEMM enablement). Additionally, the API server and engine deployments can scale replicas independently (and programmatically) to address load. See the Ray docs on data parallel deployments for more details on the deployment itself.
Below is an architectural diagram of a wide-EP deployment in Ray Serve LLM. Replicas of the DPServer deployment register with the DPRankAssigner actor to receive a DP rank and coordinate a shared IP and port number. Then, these parameters are passed to vLLM engines to bring up the DP group.

Prefill/decode disaggregation is a serving pattern that separates the prefill phase (processing prompts) from the decode phase (token generation). This pattern optimizes resource utilization by preventing prefill processing from delaying decode, and allows for better attainment of latency and throughput SLAs.
Disaggregated serving requires heterogeneous vLLM engine deployments in order to independently scale prefill and decode, set engine parameters optimally, and align on port numbers for the KV connector. This is easily done by composing Serve deployments in Ray Serve LLM. Here’s an example:
1# serve_pd.py
2from ray.serve.llm import LLMConfig, build_pd_openai_app, PrefixCacheAffinityRouter
3import ray.serve as serve
4
5# Configure prefill instance
6prefill_config = LLMConfig(
7 model_loading_config={
8 "model_id": "meta-llama/Llama-3.1-8B-Instruct"
9 },
10 deployment_config={
11 "request_router_class": PrefixCacheAffinityRouter
12 },
13 engine_kwargs={
14 "kv_transfer_config": {
15 "kv_connector": "NixlConnector",
16 "kv_role": "kv_both",
17 }
18 }
19)
20
21# Configure decode instance
22decode_config = LLMConfig(
23 model_loading_config={
24 "model_id": "meta-llama/Llama-3.1-8B-Instruct"
25 },
26 engine_kwargs={
27 "kv_transfer_config": {
28 "kv_connector": "NixlConnector",
29 "kv_role": "kv_both",
30 }
31 }
32)
33
34pd_config = dict(
35 prefill_config=prefill_config,
36 decode_config=decode_config,
37)
38
39app = build_pd_openai_app(pd_config)
40serve.run(app)
41In this example, the builder build_pd_openai_app creates independent deployments for prefill and decode and automatically sets up the KV transfer connector over NIXL, abstracting away the complexity of prefill/decode disaggregation.
Architecturally, the disaggregated serving Ray Serve application looks like the figure below. The prefill and decode deployments are built separately, with requests orchestrated by PDProxyServer. To maximize performance, the prefill deployment can be deployed with Ray Serve LLM’s PrefixCacheAffinityRouter, which routes requests to optimize prefix cache hit rate. For more information about prefix cache-aware routing, check out the announcement post.
To fulfill a request, the PDProxyServer routes inputs to the prefill deployment with max_tokens=1 to fill the KV cache. The KV cache metadata is then passed to the decode deployment, along with the original request contents, to complete the generation step. See the Ray Serve LLM prefill/decode disaggregation architecture docs for information.

As we’ve covered so far, in the sparse MoE era, engine replicas are no longer independent in optimal serving patterns. The orchestration layer must coordinate data parallel attention, expert parallel routing, and disaggregated prefill/decode execution across potentially heterogeneous hardware.
Traditional Kubernetes primitives, designed for stateless microservices, are not optimized to express these tightly coupled, stateful serving patterns that require topology-aware placement and synchronized rank assignment. Ray Serve LLM takes a different approach: making orchestration programmable rather than just configurable.
Ray Serve LLM | Kubernetes | |
|---|---|---|
Design Philosophy | Programmable (Python), .yaml for production | Configuration (.yaml) |
Multi-GPU/Multi-Node Setup | Automatic via Placement Groups | Manual via node affinity |
Wide-EP |
| StatefulSet |
Prefill/Decode Disaggregation |
| Proxy sidecars |
Prefix Cache-Aware Routing | Built-in | Third-party routing framework |
Autoscaling | Application logic-based | Resource usage-based |
Emphasis | Composability | Microservices |
Check out the Anyscale LLM’s team 2025 Ray Summit talk for a more in-depth look.
By unifying application logic and infrastructure in Python, Ray Serve LLM provides composable primitives that directly capture modern serving architectures: data-parallel attention with expert parallelism for efficient KV cache usage, prefill-decode disaggregation for independent scaling and latency optimization, and cache-aware request routing for prefix affinity and reuse.
These capabilities reduce the operational overhead of multi-pod coordination while preserving Kubernetes interoperability. With Ray Serve LLM, developers can define and deploy wide-EP and disaggregated serving topologies as simple, Pythonic builder patterns, achieving vLLM-equivalent engine performance with the added flexibility of dynamic scaling, stateful routing, and fault-tolerant orchestration.
Data parallel replica groups
Elastic expert parallelism, presented at Ray Summit 2025
Improved observability for KV transfer, speculative decoding, and LMCache
Join the Ray Slack #llm channel: https://www.ray.io/join-slack
Ray Serve LLM office hours; sign up here for the calendar invite: https://forms.gle/QsjHNGvEhhpRcW3r5
Try serving LLMs on Anyscale with $100 in free credits!
Many thanks to the Nebius team for supporting this blog post with their AI infrastructure.