Ray Summit 2025

Reinforcement Learning

Separations of Concerns in Agentic Reinforcement Learning

Will BrownPrime Intellect

This talk surveys a number of design choices and tradeoffs which arise in building performant distributed training infrastructure for agentic reinforcement learning, and offers an opinionated perspective on how modern research efforts require rethinking standard abstractions and separations of concerns. Using verifiers a case study, we argue that much sharper boundaries should be drawn between trainers and environments than is often observed in popular libraries. In particular, we propose that RL trainers ought to adopt OpenAI Chat Completions as a universal inference spec, and the role of the actor (from the trainer's perspective) should simply be to expose a generic inference endpoint to be used within environments. Further, we argue that orchestration of requests within rollouts (e.g. inference, code execution) and orchestration of requests within training (e.g. weight updates) should be entirely decoupled, and that trainers should not be aware of any environment details modulo high-level spec features (e.g. whether or not multimodality or branching rollouts are required). We view both of these requirements as crucial in order to 1) be able to source large volumes of high-quality train-ready environments, 2) allow environments and trainers to be largely model-agnostic, and 3) ensure that trained models can be directly used in applications of interest without friction.

This talk surveys a number of design choices and tradeoffs which arise in building performant distributed training infrastructure for agentic reinforcement learning, and offers an opinionated perspective on how modern research efforts require rethinking standard abstractions and separations of concerns. Using verifiers a case study, we argue that much sharper boundaries should be drawn between trainers and environments than is often observed in popular libraries. In particular, we propose that RL trainers ought to adopt OpenAI Chat Completions as a universal inference spec, and the role of the actor (from the trainer's perspective) should simply be to expose a generic inference endpoint to be used within environments. Further, we argue that orchestration of requests within rollouts (e.g. inference, code execution) and orchestration of requests within training (e.g. weight updates) should be entirely decoupled, and that trainers should not be aware of any environment details modulo high-level spec features (e.g. whether or not multimodality or branching rollouts are required). We view both of these requirements as crucial in order to 1) be able to source large volumes of high-quality train-ready environments, 2) allow environments and trainers to be largely model-agnostic, and 3) ensure that trained models can be directly used in applications of interest without friction.

Lightning Talk

Finance

Text / Docs

Leveraging Ray, vLLM and LiteLLM built a trusted LLM service for sensitive data

Wenyue LiuSenior ML Platform EngineerCoinbase

Akshit TrehanCoinbase

In H1 2025, Coinbase MLP team built a trusted LLM services by leveraging Ray, vLLM and LiteLLM to make sure Coinbase as the most trusted Crypto exchange.

In this talk, we will go through technical details of user authentication, s2s, litellm distribution, vLLM and Ray to share the whole story of how Coinbase uses Ray and vLLM to build LLM serving API and support internal LLM traffic.

In H1 2025, Coinbase MLP team built a trusted LLM services by leveraging Ray, vLLM and LiteLLM to make sure Coinbase as the most trusted Crypto exchange.

In this talk, we will go through technical details of user authentication, s2s, litellm distribution, vLLM and Ray to share the whole story of how Coinbase uses Ray and vLLM to build LLM serving API and support internal LLM traffic.

Lightning Talk

Text / Docs

Structured Data

Parallelizing Searches over Agentic Pipelines with Ray and syftr

Mark SteadmanData Science Engineering ArchitectDataRobot

Agentic pipelines are central to applying large language models (LLMs) to proprietary or dynamic data. However, constructing efficient agentic flows presents significant challenges. It necessitates precise selection among various components, including vector databases, embedding models, text splitters, retrievers, and synthesizing LLMs. Further complicating this process is the meticulous tuning required for modules such as verifiers, rewriters, and rerankers, each with their own intricate hyperparameter dependencies. In performance-sensitive applications, manually balancing the tradeoffs between latency, accuracy, and cost becomes progressively more difficult.

We introduce syftr, a framework that performs efficient, distributed, multi-objective search over a vast (1023) space of agentic and non-agentic configurations . Using advances in Bayesian Optimization, syftr discovers Pareto-optimal flows that jointly optimize task accuracy and cost. A novel early-stopping mechanism further improves efficiency by pruning clearly suboptimal candidates. Across multiple benchmarks, syftr finds flows which are on average ≈ 9× cheaper while preserving most of the accuracy of the most accurate flows on the Pareto-frontier. Furthermore, syftr’s ability to design and optimize also allows easy integration of new modules, making it even easier and faster to realize high-performing generative AI pipelines.

Building syftr is especially challenging from an infrastructure point-of-view. For example, one of the most compute intensive parts of the search process is Vector Database (VDBs) construction. As syftr is trying out multiple embedding models, chunk sizes, etc, VDB construction for large datasets forms a large part of the search compute. Small embedding models can be run on CPU (cheap and plenty) while larger ones require GPUs (expensive and scarce). syftr utilizes Ray to distribute this workload in heterogenous compute clusters of CPUs and different GPU SKUs (T4, A100, H100 etc). When we self-host OSS models in the search space, syftr creates inference load hotspots as the optimizer hones in on a few LLMs and embedding models which are part of flows on the Pareto-frontier. Ray Serve provides a way to autoscale high demand models while scaling to zero cold models.

In this talk we go deep into how Ray’s unique abilities of scale, robustness and ease-of-use accelerates research like syftr at the intersection of AI and AI infrastructure.

Paper: https://arxiv.org/abs/2505.20266 (AutoML 2025)

Code: https://github.com/datarobot/syftr

Blog: https://www.datarobot.com/blog/pareto-optimized-ai-workflows-syftr

Agentic pipelines are central to applying large language models (LLMs) to proprietary or dynamic data. However, constructing efficient agentic flows presents significant challenges. It necessitates precise selection among various components, including vector databases, embedding models, text splitters, retrievers, and synthesizing LLMs. Further complicating this process is the meticulous tuning required for modules such as verifiers, rewriters, and rerankers, each with their own intricate hyperparameter dependencies. In performance-sensitive applications, manually balancing the tradeoffs between latency, accuracy, and cost becomes progressively more difficult.

We introduce syftr, a framework that performs efficient, distributed, multi-objective search over a vast (1023) space of agentic and non-agentic configurations . Using advances in Bayesian Optimization, syftr discovers Pareto-optimal flows that jointly optimize task accuracy and cost. A novel early-stopping mechanism further improves efficiency by pruning clearly suboptimal candidates. Across multiple benchmarks, syftr finds flows which are on average ≈ 9× cheaper while preserving most of the accuracy of the most accurate flows on the Pareto-frontier. Furthermore, syftr’s ability to design and optimize also allows easy integration of new modules, making it even easier and faster to realize high-performing generative AI pipelines.

Building syftr is especially challenging from an infrastructure point-of-view. For example, one of the most compute intensive parts of the search process is Vector Database (VDBs) construction. As syftr is trying out multiple embedding models, chunk sizes, etc, VDB construction for large datasets forms a large part of the search compute. Small embedding models can be run on CPU (cheap and plenty) while larger ones require GPUs (expensive and scarce). syftr utilizes Ray to distribute this workload in heterogenous compute clusters of CPUs and different GPU SKUs (T4, A100, H100 etc). When we self-host OSS models in the search space, syftr creates inference load hotspots as the optimizer hones in on a few LLMs and embedding models which are part of flows on the Pareto-frontier. Ray Serve provides a way to autoscale high demand models while scaling to zero cold models.

In this talk we go deep into how Ray’s unique abilities of scale, robustness and ease-of-use accelerates research like syftr at the intersection of AI and AI infrastructure.

Paper: https://arxiv.org/abs/2505.20266 (AutoML 2025)

Code: https://github.com/datarobot/syftr

Blog: https://www.datarobot.com/blog/pareto-optimized-ai-workflows-syftr

Reinforcement Learning

Physical AI

Image

Structured Data

Video

Ray at Applied Intuition: RL and Inference

Yi Sheng OngSoftware EngineerApplied Intuition

Eric HigginsSoftware EngineerApplied Intuition

Applied Intuition uses Ray to scale reinforcement learning and inference workloads that operate on petabytes of raw sensor data for autonomous driving.

For reinforcement learning, Ray’s distributed execution model and RLlib enable scalable open- and closed-loop training. We leverage Ray to run thousands of parallel environment rollouts, colocate GPU learners and simulators efficiently, and recover full state during training. Our hybrid setup builds on the new RLlib API to support both imitation and closed-loop regimes — allowing driving models to generalize to out-of-distribution scenarios.

For inference, Ray Data powers large-scale batch processing of sensor samples, providing a unified, high-throughput interface for streaming data from our lake and performing CPU-intensive transformations before GPU inference. The system scales seamlessly across Kubernetes, making it easy to transition from development to production.

Applied Intuition uses Ray to scale reinforcement learning and inference workloads that operate on petabytes of raw sensor data for autonomous driving.

For reinforcement learning, Ray’s distributed execution model and RLlib enable scalable open- and closed-loop training. We leverage Ray to run thousands of parallel environment rollouts, colocate GPU learners and simulators efficiently, and recover full state during training. Our hybrid setup builds on the new RLlib API to support both imitation and closed-loop regimes — allowing driving models to generalize to out-of-distribution scenarios.

For inference, Ray Data powers large-scale batch processing of sensor samples, providing a unified, high-throughput interface for streaming data from our lake and performing CPU-intensive transformations before GPU inference. The system scales seamlessly across Kubernetes, making it easy to transition from development to production.

Reinforcement Learning

Physical AI

Research

Structured Data

End‑to‑End Hybrid Reinforcement and Imitation Learning for Robotics with Ray

Ahmet Salih GundogduMachine Learning EngineerRAI Institute (aka Boston Dynamics AI Institute)

Valerio PepeMachine Learning Engineer InternRAI Institute (aka Boston Dynamics AI Institute)

The nature of machine learning in robotics demands complex abstractions of hardware and training/simulation layers to use a combination of RL and IL (Imitation Learning). In this respect, policy learning for robotics rarely fits on one kind of machine. For instance, massive simulation parallelization with GPU physics and rendering in Isaac Lab demand RTX‑class GPUs, while policy training benefits from large VRAM and FLOPs. Over the past year we have built our infra on Ray that hides this hardware/software diversity and lets researchers focus on science, not sys‑admin.

Our platform offers:

- Unified orchestration – a single Ray workflow allows to train full state RL models that are used to train multi-task IL policy, and evaluation in simulation.

- Heterogeneous GPU scheduling – placement groups assign Isaac Lab simulators to RTX workers and gradient computation to A100/H100 trainers without manual mapping.

- Isolated deployment targets – the same job definition that trains a policy can package it into a lightweight Ray Serve micro‑service that runs next to the robot or on a nearby edge server, shielding control code from research churn.

During the live demo we will:

- Launch a hybrid RL‑IL run that automatically provisions both Nvidia-RTX GPUs and A100/H100 nodes.

- Watch Ray adapt the cluster as workloads shift from simulation to learning to evaluation.

- Deploy the resulting policy to an isolated runtime on the robot—ready for immediate testing.

Attendees will leave with practical design patterns for juggling simulator‑heavy and large‑scale network training inside one reproducible Ray ecosystem, plus insights on meeting real‑time robotics constraints while remaining GPU‑efficient.

The nature of machine learning in robotics demands complex abstractions of hardware and training/simulation layers to use a combination of RL and IL (Imitation Learning). In this respect, policy learning for robotics rarely fits on one kind of machine. For instance, massive simulation parallelization with GPU physics and rendering in Isaac Lab demand RTX‑class GPUs, while policy training benefits from large VRAM and FLOPs. Over the past year we have built our infra on Ray that hides this hardware/software diversity and lets researchers focus on science, not sys‑admin.

Our platform offers:

- Unified orchestration – a single Ray workflow allows to train full state RL models that are used to train multi-task IL policy, and evaluation in simulation.

- Heterogeneous GPU scheduling – placement groups assign Isaac Lab simulators to RTX workers and gradient computation to A100/H100 trainers without manual mapping.

- Isolated deployment targets – the same job definition that trains a policy can package it into a lightweight Ray Serve micro‑service that runs next to the robot or on a nearby edge server, shielding control code from research churn.

During the live demo we will:

- Launch a hybrid RL‑IL run that automatically provisions both Nvidia-RTX GPUs and A100/H100 nodes.

- Watch Ray adapt the cluster as workloads shift from simulation to learning to evaluation.

- Deploy the resulting policy to an isolated runtime on the robot—ready for immediate testing.

Attendees will leave with practical design patterns for juggling simulator‑heavy and large‑scale network training inside one reproducible Ray ecosystem, plus insights on meeting real‑time robotics constraints while remaining GPU‑efficient.

Machine Learning

Media & Gaming

Structured Data

A foundation model for each enterprise

Dhruv NigamLead MLEDream11

B2C enterprises possess vast and diverse datasets and often gain value from predicting user behavior. Yet their predictive modeling capabilities are fragmented into siloed, single-task systems. This approach creates redundant feature engineering, incurs excessive training costs, and lacks the flexibility to adapt to new predictive tasks.

We argue, based on our experience at Dream11, that the paradigm of large-scale, pre-trained foundation models can be extended beyond the domain of language to create a single, cohesive user-intelligence engine for any enterprise. This talk introduces Lumos, a framework for building enterprise-specific foundation models, and details how Ray played a crucial role in overcoming the significant scaling challenges involved.

The core of Lumos is a multi-task, multi-timestep Transformer architecture designed to forecast a wide array of user behaviors (e.g., churn, engagement, lifetime value) by ingesting historical user transactions, user attributes, and the calendar of future business events (supply). Our architecture introduces a strong inductive bias through a cross-attention mechanism, where a decoder conditioned on future supply events attends to an encoder that has processed the full history of user behavior. We will also share our work on formulating the equivalents of scaling laws and emergent abilities in the realm of enterprise transaction data. While Ray datasets helped us efficiently load data from blob storage into GPUs, Ray train helped us distribute training across multi-node, multi-GPU clusters without needing any code changes.

Using these modules, we were able to efficiently train models using over 50 terabytes of user data across multiple nodes and GPUs, achieving high accuracy and notable improvements in online metrics.

B2C enterprises possess vast and diverse datasets and often gain value from predicting user behavior. Yet their predictive modeling capabilities are fragmented into siloed, single-task systems. This approach creates redundant feature engineering, incurs excessive training costs, and lacks the flexibility to adapt to new predictive tasks.

We argue, based on our experience at Dream11, that the paradigm of large-scale, pre-trained foundation models can be extended beyond the domain of language to create a single, cohesive user-intelligence engine for any enterprise. This talk introduces Lumos, a framework for building enterprise-specific foundation models, and details how Ray played a crucial role in overcoming the significant scaling challenges involved.

The core of Lumos is a multi-task, multi-timestep Transformer architecture designed to forecast a wide array of user behaviors (e.g., churn, engagement, lifetime value) by ingesting historical user transactions, user attributes, and the calendar of future business events (supply). Our architecture introduces a strong inductive bias through a cross-attention mechanism, where a decoder conditioned on future supply events attends to an encoder that has processed the full history of user behavior. We will also share our work on formulating the equivalents of scaling laws and emergent abilities in the realm of enterprise transaction data. While Ray datasets helped us efficiently load data from blob storage into GPUs, Ray train helped us distribute training across multi-node, multi-GPU clusters without needing any code changes.

Using these modules, we were able to efficiently train models using over 50 terabytes of user data across multiple nodes and GPUs, achieving high accuracy and notable improvements in online metrics.

Ray Deep Dives

Optimizing GPU Utilization: Ray Train’s Distributed Solutions for Removing Training Bottlenecks

Justin Yu

TBD

TBD

Ray Deep Dives

Ray Serve: Advancing scalability and flexibility

Cindy Zhang

Abrar ShiekhSoftware EngineerAnyscale

Ray Serve has become one of the most popular libraries for modern AI applications. Unlike most online inference frameworks, Ray Serve is natively built to support multiple models, any hardware, and inference framework. This session will highlight significant advancements in Ray Serve last year including - 1. flexibility to capture more complex inference patterns, 2. performance at scale, 3. multi-cloud inference.

Ray Serve has become one of the most popular libraries for modern AI applications. Unlike most online inference frameworks, Ray Serve is natively built to support multiple models, any hardware, and inference framework. This session will highlight significant advancements in Ray Serve last year including - 1. flexibility to capture more complex inference patterns, 2. performance at scale, 3. multi-cloud inference.

Lunch + Networking

Grab lunch and explore Rayground, where you'll find Anyscale demos, sponsor booths, and the Lightning Theater.

Grab lunch and explore Rayground, where you'll find Anyscale demos, sponsor booths, and the Lightning Theater.

vLLM

Text / Docs

State of vLLM 2025

Simon MovLLM LeadvLLM - UC Berkeley

In this talk, we will cover the latest one year in review for the vLLM project and discuss the road ahead.

In this talk, we will cover the latest one year in review for the vLLM project and discuss the road ahead.

Lightning Talk

Text / Docs

Horizontal, Predictable, High-Throughput Inference for Synthetic Data Generation, Evals, and More

Seth KimmelCEOSutro

Sutro (https://sutro.sh/) is an accelerated batch inference service. We use vLLM under the hood to power offline inference workloads ranging from a few hundred to tens of billions of tokens, often for synthetic data generation, evals, or processing unstructured data. It's critical for us to be able to use vLLM in a predictable way - from a cost, performance, and transparency standpoint. In this talk we'll explain how we use vLLM under the hood, from our custom implementation, to our performance profiler, throughput estimation algorithms, and cost attribution instrumentation. This talk is geared towards teams looking to push the boundaries of what's possible with vLLM at scale.

Sutro (https://sutro.sh/) is an accelerated batch inference service. We use vLLM under the hood to power offline inference workloads ranging from a few hundred to tens of billions of tokens, often for synthetic data generation, evals, or processing unstructured data. It's critical for us to be able to use vLLM in a predictable way - from a cost, performance, and transparency standpoint. In this talk we'll explain how we use vLLM under the hood, from our custom implementation, to our performance profiler, throughput estimation algorithms, and cost attribution instrumentation. This talk is geared towards teams looking to push the boundaries of what's possible with vLLM at scale.

Lightning Talk

Structured Data

Improved Scheduling Flexibility with Label Selectors in Ray

Ryan O'LearySoftware EngineerGoogle

Mengjin YanSoftware EngineerAnyscale

Acquiring scarce accelerator resources for Ray applications on a heterogeneous cluster can be challenging due to different accelerator type and topology requirements and limited availability. These issues previously required workarounds such as setting custom resources and accelerator_type.

Ray's new Label Selector API helps alleviate these challenges by enabling users to schedule tasks, actors, and placement groups using Ray node labels specified at RayCluster creation by the user and detected automatically by Ray. This API offers support for both static and auto-scaling RayClusters, fallback strategies, and per-bundle selectors, enabling users to make precise placement decisions at the application level. This functionality is incorporated in the Anyscale platform, Ray dashboard, and KubeRay. The same user code operates identically across platforms.

This talk will primarily explore common use cases, API modifications, and a live demo highlighting how the new label selector API enhances scheduling flexibility.

Acquiring scarce accelerator resources for Ray applications on a heterogeneous cluster can be challenging due to different accelerator type and topology requirements and limited availability. These issues previously required workarounds such as setting custom resources and accelerator_type.

Ray's new Label Selector API helps alleviate these challenges by enabling users to schedule tasks, actors, and placement groups using Ray node labels specified at RayCluster creation by the user and detected automatically by Ray. This API offers support for both static and auto-scaling RayClusters, fallback strategies, and per-bundle selectors, enabling users to make precise placement decisions at the application level. This functionality is incorporated in the Anyscale platform, Ray dashboard, and KubeRay. The same user code operates identically across platforms.

This talk will primarily explore common use cases, API modifications, and a live demo highlighting how the new label selector API enhances scheduling flexibility.

Machine Learning

Structured Data

Exabyte-scale Streaming Iceberg IO with Ray, Flink, and DeltaCAT

Patrick AmesPrincipal EngineerAmazon

Production case study highlighting how Amazon uses Ray and DeltaCAT at exabyte-scale to resolve longstanding performance & scale challenges integrating streaming pipelines with Apache Iceberg. Highlights how the Apache Flink, Ray, Apache Beam, and Apache Spark communities can start bringing the same benefits to their workloads using DeltaCAT's Iceberg Table management jobs on Ray together with Flink and Beam.

Production case study highlighting how Amazon uses Ray and DeltaCAT at exabyte-scale to resolve longstanding performance & scale challenges integrating streaming pipelines with Apache Iceberg. Highlights how the Apache Flink, Ray, Apache Beam, and Apache Spark communities can start bringing the same benefits to their workloads using DeltaCAT's Iceberg Table management jobs on Ray together with Flink and Beam.

Machine Learning

Image

Structured Data

Video

Scalable High-Performance Multi-Modal Data Curation on Ray

Arham MehtaProduct Manager, Deep Learning SoftwareNVIDIA

Processing petabyte-scale, multi-modal data for Generative AI - spanning text, video, audio, and more—is a complex distributed systems challenge. These pipelines require a framework capable of handling heterogeneous workloads, stateful operations like deduplication, and GPU acceleration. This session explores architectural patterns for building such pipelines using Ray.

Drawing on our experience building NVIDIA NeMo Curator - we demonstrate how Ray’s primitives enable efficient, scalable data processing. We will cover how to leverage Ray Actors for stateful, long-running tasks and Ray Tasks for stateless parallel transformations, managing heterogeneous CPU/GPU resources to maximize throughput and pipeline robustness.

Processing petabyte-scale, multi-modal data for Generative AI - spanning text, video, audio, and more—is a complex distributed systems challenge. These pipelines require a framework capable of handling heterogeneous workloads, stateful operations like deduplication, and GPU acceleration. This session explores architectural patterns for building such pipelines using Ray.

Drawing on our experience building NVIDIA NeMo Curator - we demonstrate how Ray’s primitives enable efficient, scalable data processing. We will cover how to leverage Ray Actors for stateful, long-running tasks and Ray Tasks for stateless parallel transformations, managing heterogeneous CPU/GPU resources to maximize throughput and pipeline robustness.

LLMs

Media & Gaming

Text / Docs

Scaling LLM Post-Training at Character.AI

Haoran LiCharacter AI

Character AI is the world's leading application for AI entertainment, serving tens of millions of users per day with large language models (LLMs). To continuously improve the models that power our AI Characters, we have built a robust and scalable post-training stack entirely on open-source technologies in the Ray ecosystem. Our fine-tuning stack, internally named Rayman, has allowed us to accelerate our model development velocity and large MoE model training efficiency. We are also able to utilize and adapt open-source RL libraries(Verl) to deal with our unique challenges in RL training. In this talk, we will detail the architecture of Rayman, the open-source projects we leverage, our RL framework and the ML challenges we've overcome.

Specifically, we will cover:

1. Infrastructure for Fine-tuning and Distillation: We will introduce Rayman, our internal framework built on Ray Data, Ray Train, and Deepspeed/PyTorch FSDP/Internal Pipeline SFT for orchestrating all distributed workloads. We'll detail how we use this for large-scale SFT, DPO including our strategy for training massive Mixture-of-Experts (MoE) models like those from DeepSeek. We will also go over our approach for knowledge distillation of state of the art open source LLMs into smaller, more tractable models.

2. Reinforcement Learning from Real User Feedback: A core challenge in aligning models for open-ended creative dialogue is that there are no verifiable rewards. We will discuss how we tackle this problem by training our own reward models on real user interaction data which we then use for RL. We'll detail our RL framework built on top of Verl, which allows us to translate noisy, real-world user feedback into a clear signal that can be effectively "hill-climbed" using a variety of reinforcement learning techniques to significantly improve the quality of our models.

Character AI is the world's leading application for AI entertainment, serving tens of millions of users per day with large language models (LLMs). To continuously improve the models that power our AI Characters, we have built a robust and scalable post-training stack entirely on open-source technologies in the Ray ecosystem. Our fine-tuning stack, internally named Rayman, has allowed us to accelerate our model development velocity and large MoE model training efficiency. We are also able to utilize and adapt open-source RL libraries(Verl) to deal with our unique challenges in RL training. In this talk, we will detail the architecture of Rayman, the open-source projects we leverage, our RL framework and the ML challenges we've overcome.

Specifically, we will cover:

1. Infrastructure for Fine-tuning and Distillation: We will introduce Rayman, our internal framework built on Ray Data, Ray Train, and Deepspeed/PyTorch FSDP/Internal Pipeline SFT for orchestrating all distributed workloads. We'll detail how we use this for large-scale SFT, DPO including our strategy for training massive Mixture-of-Experts (MoE) models like those from DeepSeek. We will also go over our approach for knowledge distillation of state of the art open source LLMs into smaller, more tractable models.

2. Reinforcement Learning from Real User Feedback: A core challenge in aligning models for open-ended creative dialogue is that there are no verifiable rewards. We will discuss how we tackle this problem by training our own reward models on real user interaction data which we then use for RL. We'll detail our RL framework built on top of Verl, which allows us to translate noisy, real-world user feedback into a clear signal that can be effectively "hill-climbed" using a variety of reinforcement learning techniques to significantly improve the quality of our models.

Breakfast + Networking

Enjoy a light breakfast and coffee while mingling with Ray attendees

Enjoy a light breakfast and coffee while mingling with Ray attendees

Lightning Talk

Text / Docs

Structured Data

From S3 Bottlenecks to Scalable I/O: Evolving Ray Pipelines with Alluxio

Bin FanVP of TechnologyAlluxio

Fast model iteration is essential for AI startups, especially those building advanced document understanding and search capabilities. In this talk, we walk through the evolution of a Ray-based training pipeline from a real-world use case, designed to accelerate experimentation and reduce deployment latency for new search models. We compare three architectural designs: direct S3 access from Ray, Alluxio as an async write buffer, and Alluxio with fully decentralized cache-through writes.

In the initial setup, direct I/O between Ray for preprocessing and PyTorch for training struggled under high concurrency—over 1,000 Ray workers writing during preprocessing—leading to poor throughput and high S3 egress costs. A second iteration introduced Alluxio with async writes, which improved training-stage read performance, but metadata saturation and write path instability led to frequent cluster crashes and job restarts.

Migrating to a decentralized Alluxio 3.x setup with cache-through writes enabled stable ingestion under bursty 400 Gbps internal bandwidth, while throttling 10 Gbps outbound writes to S3. The result: no restarts, better GPU utilization, and fast, scalable iteration. We’ll share practical lessons on turning a fragile I/O stack into a production-grade AI data infrastructure.

Fast model iteration is essential for AI startups, especially those building advanced document understanding and search capabilities. In this talk, we walk through the evolution of a Ray-based training pipeline from a real-world use case, designed to accelerate experimentation and reduce deployment latency for new search models. We compare three architectural designs: direct S3 access from Ray, Alluxio as an async write buffer, and Alluxio with fully decentralized cache-through writes.

In the initial setup, direct I/O between Ray for preprocessing and PyTorch for training struggled under high concurrency—over 1,000 Ray workers writing during preprocessing—leading to poor throughput and high S3 egress costs. A second iteration introduced Alluxio with async writes, which improved training-stage read performance, but metadata saturation and write path instability led to frequent cluster crashes and job restarts.

Migrating to a decentralized Alluxio 3.x setup with cache-through writes enabled stable ingestion under bursty 400 Gbps internal bandwidth, while throttling 10 Gbps outbound writes to S3. The result: no restarts, better GPU utilization, and fast, scalable iteration. We’ll share practical lessons on turning a fragile I/O stack into a production-grade AI data infrastructure.

Ray Summit Celebration Happy Hour

Join colleagues and friends in the Rayground for demos, sponsors and Ray conversations.

Join colleagues and friends in the Rayground for demos, sponsors and Ray conversations.

Matrix: reliable framework for data-centric experimentation at scale

Dong WangResearch EngineerMeta

Shang-Wen LiResearch scientistMeta

Ramya RaghavendraTechnical Program DirectorMeta

Scaled and high quality data is the oil driving progress of AGI in research and development. Thanks to the foundational works such as Ray, Slurm, and vLLM, it becomes a lot easier to manage compute resources at scale and access to a diverse set of SOTA LLMs. However, these efforts are often designed for experienced engineers with entry barriers for researchers to unleash their full potential. Thus, in Fundamental AI Research Lab (FAIR) at Meta, we have built Matrix, a reliable framework for data-centric experimentation at scale, to connect these foundational pieces for researchers to quickly iterate on their ideas and build experiments with large-scale models and data.

Matrix supports robust and auto-scaled data generation from LLMs, game engine, and physics or world model simulators, with one command. It also offers easy setup for scalable data processing and augmentation such as LLM-as-a-judge in batch, safe code execution for verification, and data dedup, classification, or clustering. The framework also offers efficient and reproducible evaluation pipelines for large teams to collaborate on.

Matrix is widely used to empower Meta’s research and production bets in AGI including MLLMs and world modeling. In this session, we will introduce the Matrix framework from its design and synergy with other industry initiatives like Ray and vLLM, to the research and production use cases Matrix enables. We will also provide a short tutorial for developers to join the area.

Scaled and high quality data is the oil driving progress of AGI in research and development. Thanks to the foundational works such as Ray, Slurm, and vLLM, it becomes a lot easier to manage compute resources at scale and access to a diverse set of SOTA LLMs. However, these efforts are often designed for experienced engineers with entry barriers for researchers to unleash their full potential. Thus, in Fundamental AI Research Lab (FAIR) at Meta, we have built Matrix, a reliable framework for data-centric experimentation at scale, to connect these foundational pieces for researchers to quickly iterate on their ideas and build experiments with large-scale models and data.

Matrix supports robust and auto-scaled data generation from LLMs, game engine, and physics or world model simulators, with one command. It also offers easy setup for scalable data processing and augmentation such as LLM-as-a-judge in batch, safe code execution for verification, and data dedup, classification, or clustering. The framework also offers efficient and reproducible evaluation pipelines for large teams to collaborate on.

Matrix is widely used to empower Meta’s research and production bets in AGI including MLLMs and world modeling. In this session, we will introduce the Matrix framework from its design and synergy with other industry initiatives like Ray and vLLM, to the research and production use cases Matrix enables. We will also provide a short tutorial for developers to join the area.

Machine Learning

Structured Data

Breaking the Dataset Iteration Bottleneck: Real-Time ML Experimentation with Ray

Sameer JainMachine Learning EngineerPinterest

Kritarth AnandStaff Machine Learning EngineerPinterest

Filip RyznerMachine Learning EngineerPinterest

At Pinterest, iterating on dataset curation and label generation consistently improves our recommendation models, but this process is severely constrained by expensive and time-consuming data generation workflows. When experimenting with new sampling strategies, features, or labels, teams face a critical choice: either backfill long-running jobs that strain compute resources and budget, or wait weeks for experimental datasets to naturally populate with new data. This creates a fundamental barrier to data-driven model improvement, where a single dataset iteration either costs thousands of dollars and requires a tedious monitoring of the backfill process, or takes weeks of waiting. In either case, the developer velocity is severely impacted.

Two pivotal use-cases within Pinterest exemplify these challenges, namely the Dataset Sampling Strategy Exploration and the Generation of Labels for Downstream Engagement Modeling. Sampling is fundamental for creating training datasets from massive data repositories. Our sampling strategy determines the composition and quality of resulting datasets, thus the resulting model, yet iterating on these strategies is prohibitively difficult and expensive. The current data generation workflows also prevent adoption of sophisticated techniques like score-based negative sampling that requires real-time computation during training. Downstream engagement labels present a similarly complex challenge. Unlike immediate action labels, these labels focus on driving long-term user engagement rather than instant responses. The complexity increases because each label involves multiple tunable hyperparameters (e.g. engagement decay) creating a vast search space. In both cases, teams would ideally conduct hyperparameter tuning to systematically explore these vast search spaces and identify optimal configurations, but the current data pipeline architecture makes such comprehensive exploration prohibitively expensive and time-consuming.

To address these limitations, we shifted both use-cases from static dataset generation to a streaming paradigm using Ray that enables truly iterative experimentation, moving sampling and label generation logic directly into the training dataloader to process data in real-time. This eliminates the costly choice between expensive backfills and weeks of waiting, while enabling comprehensive hyperparameter exploration. The impact spans both domains: sampling changes on ranking models now ship with 10x faster development time, while downstream engagement label experimentation has been reduced from 6 weeks to 3 days and adopted by multiple teams. The solution's power is fully realized during productization, where teams must simultaneously optimize both label generation parameters and sampling strategies - our unified approach handles both seamlessly within the same pipeline. Combined with Ray's bucket join capabilities that enable joining large embedding features and multiday datasets previously impossible due to cost and compute constraints, this has saved over hundreds of thousands dollars in costs while transforming dataset iteration from a fundamental bottleneck into an enabler of rapid experimentation.

At Pinterest, iterating on dataset curation and label generation consistently improves our recommendation models, but this process is severely constrained by expensive and time-consuming data generation workflows. When experimenting with new sampling strategies, features, or labels, teams face a critical choice: either backfill long-running jobs that strain compute resources and budget, or wait weeks for experimental datasets to naturally populate with new data. This creates a fundamental barrier to data-driven model improvement, where a single dataset iteration either costs thousands of dollars and requires a tedious monitoring of the backfill process, or takes weeks of waiting. In either case, the developer velocity is severely impacted.

Two pivotal use-cases within Pinterest exemplify these challenges, namely the Dataset Sampling Strategy Exploration and the Generation of Labels for Downstream Engagement Modeling. Sampling is fundamental for creating training datasets from massive data repositories. Our sampling strategy determines the composition and quality of resulting datasets, thus the resulting model, yet iterating on these strategies is prohibitively difficult and expensive. The current data generation workflows also prevent adoption of sophisticated techniques like score-based negative sampling that requires real-time computation during training. Downstream engagement labels present a similarly complex challenge. Unlike immediate action labels, these labels focus on driving long-term user engagement rather than instant responses. The complexity increases because each label involves multiple tunable hyperparameters (e.g. engagement decay) creating a vast search space. In both cases, teams would ideally conduct hyperparameter tuning to systematically explore these vast search spaces and identify optimal configurations, but the current data pipeline architecture makes such comprehensive exploration prohibitively expensive and time-consuming.

To address these limitations, we shifted both use-cases from static dataset generation to a streaming paradigm using Ray that enables truly iterative experimentation, moving sampling and label generation logic directly into the training dataloader to process data in real-time. This eliminates the costly choice between expensive backfills and weeks of waiting, while enabling comprehensive hyperparameter exploration. The impact spans both domains: sampling changes on ranking models now ship with 10x faster development time, while downstream engagement label experimentation has been reduced from 6 weeks to 3 days and adopted by multiple teams. The solution's power is fully realized during productization, where teams must simultaneously optimize both label generation parameters and sampling strategies - our unified approach handles both seamlessly within the same pipeline. Combined with Ray's bucket join capabilities that enable joining large embedding features and multiday datasets previously impossible due to cost and compute constraints, this has saved over hundreds of thousands dollars in costs while transforming dataset iteration from a fundamental bottleneck into an enabler of rapid experimentation.

Machine Learning

Text / Docs

Image

Structured Data

Super App, Super Powers: How a Small Team Built Grab's Multi-Modal Foundation Model with Ray

Nick BuehrerLead Machine Learning EngineerGrab

At Grab, Southeast Asia's leading super app, a single user journey is a rich, multi-modal story. To understand our users holistically, we needed to learn from the complex web of their interactions across our diverse services. Our goal was to build a powerful user embedding foundation model that could capture this complete view, enhancing numerous downstream models and personalizing the user experience.

The core output of this model is a set of powerful, general-purpose user embeddings. These numerical representations act as a universal feature set, designed to fuel a wide array of downstream applications and eliminate the need for siloed, hand-engineered features for each task.

However, off-the-shelf models could not comprehend our unique data ecosystem—a complex blend of long-term tabular profiles and short-term sequential interactions. This forced us to develop a custom transformer architecture that unifies these diverse data types using a novel key-value tokenization strategy and modality-specific adapters.

Training this model on terabytes of data and generating millions of embeddings daily presented a significant scaling challenge, especially for a small team. We overcame this by leveraging ray to build an efficient, distributed computing pipeline.

Today, our pre-trained embeddings are a critical component for systems across the company. They are actively powering a wide range of applications, including Churn Prediction, Ads Optimization, and Dual App Detection, creating millions of fresh embeddings daily for our users, merchants, and drivers.

At Grab, Southeast Asia's leading super app, a single user journey is a rich, multi-modal story. To understand our users holistically, we needed to learn from the complex web of their interactions across our diverse services. Our goal was to build a powerful user embedding foundation model that could capture this complete view, enhancing numerous downstream models and personalizing the user experience.

The core output of this model is a set of powerful, general-purpose user embeddings. These numerical representations act as a universal feature set, designed to fuel a wide array of downstream applications and eliminate the need for siloed, hand-engineered features for each task.

However, off-the-shelf models could not comprehend our unique data ecosystem—a complex blend of long-term tabular profiles and short-term sequential interactions. This forced us to develop a custom transformer architecture that unifies these diverse data types using a novel key-value tokenization strategy and modality-specific adapters.

Training this model on terabytes of data and generating millions of embeddings daily presented a significant scaling challenge, especially for a small team. We overcame this by leveraging ray to build an efficient, distributed computing pipeline.

Today, our pre-trained embeddings are a critical component for systems across the company. They are actively powering a wide range of applications, including Churn Prediction, Ads Optimization, and Dual App Detection, creating millions of fresh embeddings daily for our users, merchants, and drivers.

Machine Learning

Media & Gaming

Text / Docs

Image

JIT-Embedding with Ray Serve: Accelerating Large-Scale GenAI Foundation Model Training in Adobe Firefly

Feng WangSenior Software EngineerAdobe

Steven LiuSenior Machine Learning EngineerAdobe

Baqiao LiuMachine Learning EngineerAdobe

Haoran CaiTech Lead ManagerAdobe

This presentation introduces JIT-Embedding (Just-in-Time Embedding), a novel solution designed to accelerate the training of foundational Generative AI (GenAI) models, with a focus on image and video generation in Adobe Firefly. By decoupling the expensive embedding computation from model training, JIT-Embedding enables these processes to scale independently. Built on Ray Serve, our architecture includes a robust JIT Service and JIT Client, seamlessly integrated with our Model Hub and Dataloader. The experiment results demonstrate that this approach significantly improved scalability, enabled higher-resolution and larger-scale GenAI Foundation Model training, and achieved notable performance gains and cost reductions. It's one of the innovations contributing to Firefly Video Model public release.

JIT-Embedding addresses several key challenges in large-scale foundation diffusion model training:

1. Slow on-the-fly embedding computation during training (e.g., VAE, CLIP, and T5 embeddings).

2. Long turnaround time required for offline embedding pre-computation.

3. High cost associated with recomputing embeddings using either approach.

4. Severe GPU memory constraints when training large models or processing high-resolution images/videos.

Our solution introduces several innovations to mitigate these issues:

1. JIT Service via Ray Serve: Wraps embedding computation as an on-demand service, deployable on underutilized lower-tier GPUs (e.g. A100), freeing up high-end GPUs (H100) for model training and optimizing resource allocation. The GPU memory requirements drop significantly for both side.

2. JIT Client with Dataloader Integration: Uses multiprocessing and prefetching to overlap embedding requests with training, effectively hiding latency of on-the-fly embedding computation and maximizing GPU utilization.

3. Efficient Serialization/Deserialization: We created a Rust + Python library, inspired by functional programming, to efficiently compresses multimodal data (e.g., images, videos, long text) to improve server–client communication throughput and flexibility.

4. Advanced Performance Optimization: Combines Ray Serve’s dashboards with our custom metrics, profiling, and load testing tools. We leverage advanced Ray features such as autoscaling, dynamic batching, and in-place model updates. Moreover, we came up with some key optimizations include client-side load balancing, faster video/image codecs in rust, overlapping CPU/GPU ops, and shared GPU usage across multiple models.

5. JIT Cache: Automatically stores computed embeddings for reuse across future training jobs, further reducing cost and computation time.

We plan to open source the JIT-Embedding solution, including the services, clients and Serialization/Deserialization library.

This talk will provide a comprehensive overview of the JIT-Embedding architecture, including the design of the JIT Service, JIT Client, Serialization/Deserialization, and the caching mechanism. We will present end-to-end experimental results from large-scale model training, showcasing the system’s scalability, performance enhancements, and cost efficiency. The session will conclude with key takeaways from our journey with Ray Serve and future directions for continued optimization.

This presentation introduces JIT-Embedding (Just-in-Time Embedding), a novel solution designed to accelerate the training of foundational Generative AI (GenAI) models, with a focus on image and video generation in Adobe Firefly. By decoupling the expensive embedding computation from model training, JIT-Embedding enables these processes to scale independently. Built on Ray Serve, our architecture includes a robust JIT Service and JIT Client, seamlessly integrated with our Model Hub and Dataloader. The experiment results demonstrate that this approach significantly improved scalability, enabled higher-resolution and larger-scale GenAI Foundation Model training, and achieved notable performance gains and cost reductions. It's one of the innovations contributing to Firefly Video Model public release.

JIT-Embedding addresses several key challenges in large-scale foundation diffusion model training:

1. Slow on-the-fly embedding computation during training (e.g., VAE, CLIP, and T5 embeddings).

2. Long turnaround time required for offline embedding pre-computation.

3. High cost associated with recomputing embeddings using either approach.

4. Severe GPU memory constraints when training large models or processing high-resolution images/videos.

Our solution introduces several innovations to mitigate these issues:

1. JIT Service via Ray Serve: Wraps embedding computation as an on-demand service, deployable on underutilized lower-tier GPUs (e.g. A100), freeing up high-end GPUs (H100) for model training and optimizing resource allocation. The GPU memory requirements drop significantly for both side.

2. JIT Client with Dataloader Integration: Uses multiprocessing and prefetching to overlap embedding requests with training, effectively hiding latency of on-the-fly embedding computation and maximizing GPU utilization.

3. Efficient Serialization/Deserialization: We created a Rust + Python library, inspired by functional programming, to efficiently compresses multimodal data (e.g., images, videos, long text) to improve server–client communication throughput and flexibility.

4. Advanced Performance Optimization: Combines Ray Serve’s dashboards with our custom metrics, profiling, and load testing tools. We leverage advanced Ray features such as autoscaling, dynamic batching, and in-place model updates. Moreover, we came up with some key optimizations include client-side load balancing, faster video/image codecs in rust, overlapping CPU/GPU ops, and shared GPU usage across multiple models.

5. JIT Cache: Automatically stores computed embeddings for reuse across future training jobs, further reducing cost and computation time.

We plan to open source the JIT-Embedding solution, including the services, clients and Serialization/Deserialization library.

This talk will provide a comprehensive overview of the JIT-Embedding architecture, including the design of the JIT Service, JIT Client, Serialization/Deserialization, and the caching mechanism. We will present end-to-end experimental results from large-scale model training, showcasing the system’s scalability, performance enhancements, and cost efficiency. The session will conclude with key takeaways from our journey with Ray Serve and future directions for continued optimization.

Lightning Talk

Text / Docs

Powering the Future of LLMs: AWS and the vLLM open source project

Phi NguyenGenAI specialistAWS

Amazon is a strong supporter and contributor to vLLM, the leading open source inference engine for serving LLM. vLLM is used across Amazon and enables millions of customers to use the Amazon Rufus shopping assistant. vLLM's support for heterogeneous hardware, including AWS Trainium and NVIDIA GPUs, has enabled deployment of a cost-optimized, multi-node inference architecture. This hybrid approach allows us to route requests to the most appropriate accelerator, leading to infrastructure cost savings without compromising performance. In this session, we'll dive into AWS deployment options with vLLM, our existing open source work streams, and other initiatives.

Amazon is a strong supporter and contributor to vLLM, the leading open source inference engine for serving LLM. vLLM is used across Amazon and enables millions of customers to use the Amazon Rufus shopping assistant. vLLM's support for heterogeneous hardware, including AWS Trainium and NVIDIA GPUs, has enabled deployment of a cost-optimized, multi-node inference architecture. This hybrid approach allows us to route requests to the most appropriate accelerator, leading to infrastructure cost savings without compromising performance. In this session, we'll dive into AWS deployment options with vLLM, our existing open source work streams, and other initiatives.

vLLM

Text / Docs

Scaling LLM Inference with RayServe & vLLM: Building a Serverless Internal, Enterprise Model Hosting Platform

Deepak ChandramouliML InfraApple

Rehan DurraniML Infra EngineerApple

In this talk, we'll share how we built an internal, enterprise, serverless model hosting platform using RayServe and vLLM—powering fast, scalable LLM inference across teams. Drawing inspiration from best-in-class industry solutions, our platform empowers users to deploy and manage models through a streamlined, self-service interface. We’ll dive into the key capabilities we’ve layered on top, including challenges and solutions around multi-tenancy, auto-scaling, token-level budgeting, request observability, and fine-grained resource controls. Whether you're building for internal developers or external customers, this session will show how RayServe and vLLM can be combined to deliver reliable, production-grade model inference at scale.

In this talk, we'll share how we built an internal, enterprise, serverless model hosting platform using RayServe and vLLM—powering fast, scalable LLM inference across teams. Drawing inspiration from best-in-class industry solutions, our platform empowers users to deploy and manage models through a streamlined, self-service interface. We’ll dive into the key capabilities we’ve layered on top, including challenges and solutions around multi-tenancy, auto-scaling, token-level budgeting, request observability, and fine-grained resource controls. Whether you're building for internal developers or external customers, this session will show how RayServe and vLLM can be combined to deliver reliable, production-grade model inference at scale.

Ray Deep Dives

Ray Direct Transport: Direct GPU-GPU communication in Ray Core

Stephanie WangAnyscale

Qiaolin YuAnyscale

GPU workloads on Ray often hit a hidden bottleneck: every tensor passed between tasks takes a costly trip through CPU memory and serialization. Ray Direct Transport (RDT) is a new feature in Ray Core that eliminates this overhead by keeping GPU data on the device and transferring it directly between actors—no unnecessary copies, no serialization.
Powered by high-performance backends like NCCL, Gloo, and RDMA, RDT enables easy and efficient scaling of cutting-edge workloads like reinforcement learning for LLMs and disaggregated multimodal training. In this talk, we’ll show how RDT integrates seamlessly with the familiar Ray ObjectRef API, the architecture behind RDT, and demonstrate how it unlocks fast and flexible distributed GPU programming.

GPU workloads on Ray often hit a hidden bottleneck: every tensor passed between tasks takes a costly trip through CPU memory and serialization. Ray Direct Transport (RDT) is a new feature in Ray Core that eliminates this overhead by keeping GPU data on the device and transferring it directly between actors—no unnecessary copies, no serialization.
Powered by high-performance backends like NCCL, Gloo, and RDMA, RDT enables easy and efficient scaling of cutting-edge workloads like reinforcement learning for LLMs and disaggregated multimodal training. In this talk, we’ll show how RDT integrates seamlessly with the familiar Ray ObjectRef API, the architecture behind RDT, and demonstrate how it unlocks fast and flexible distributed GPU programming.

Lightning Talk

Text / Docs

Structured Data

Synthetic data generation with ray data + serve + vLLM

Arthur BookMachine Learning EngineerLiquid AI

This talk covers design patterns and considerations for combining ray data + serve + vLLM to construct scalable, high throughput pipelines for synthetic data generation. As an illustrative example, we implement a two-agent self-refinement loop using ray.serve + vLLM and integrate it into a ray.data pipeline.

This talk covers design patterns and considerations for combining ray data + serve + vLLM to construct scalable, high throughput pipelines for synthetic data generation. As an illustrative example, we implement a two-agent self-refinement loop using ray.serve + vLLM and integrate it into a ray.data pipeline.

Machine Learning

Media & Gaming

Structured Data

Mako: Netflix's Next Generation ML Training Platform

Avin RegmiEngineering ManagerNetflix

Matan AppelbaumSenior Software EngineerNetflix

At Netflix, we are building Mako, a new ML training platform designed to meet the demands of modern AI workloads. In this talk, we will share how we evolved our training platform, improved GPU efficiency using a custom scheduler, and made key architecture changes to support large-scale training. We will also cover how Ray fits into this journey and what we learned along the way.

At Netflix, we are building Mako, a new ML training platform designed to meet the demands of modern AI workloads. In this talk, we will share how we evolved our training platform, improved GPU efficiency using a custom scheduler, and made key architecture changes to support large-scale training. We will also cover how Ray fits into this journey and what we learned along the way.

Machine Learning

Physical AI

Video

Optimizing Video AI at Scale: Cost-Effective ML Operations with Geotab and Anyscale Ray

Mengliao WangAI Platform ManagerGeotab

Processing and deriving intelligence from billions of frames of video data captured by Geotab cameras can be a resource-intensive task. This presentation will share Geotab's journey of building a cost-efficient and highly automated Smart Video Platform utilizing Anyscale Ray.

We will showcase how Ray serves as the backbone for hosting and orchestrating our machine learning models for video analysis, enabling both efficient real-time inference and batch processing.

A key focus will be on our automated training and validation workflows, which leverage Ray's distributed capabilities to dramatically reduce the time and cost associated with model development and deployment. Learn how Geotab is achieving significant operational savings and accelerating innovation in video analytics through a strategic embrace of Anyscale Ray.

Processing and deriving intelligence from billions of frames of video data captured by Geotab cameras can be a resource-intensive task. This presentation will share Geotab's journey of building a cost-efficient and highly automated Smart Video Platform utilizing Anyscale Ray.

We will showcase how Ray serves as the backbone for hosting and orchestrating our machine learning models for video analysis, enabling both efficient real-time inference and batch processing.

A key focus will be on our automated training and validation workflows, which leverage Ray's distributed capabilities to dramatically reduce the time and cost associated with model development and deployment. Learn how Geotab is achieving significant operational savings and accelerating innovation in video analytics through a strategic embrace of Anyscale Ray.

Lightning Talk

Structured Data

How Ray Is Powering The Open-source Data Ecosystem

Enrico ShippoleTeraflop.ai

Need to process 99% of US precedential caselaw in under $1? Scale distributed video processing to hundreds of terabytes of Creative Commons videos? Seamlessly churn through petabytes of text and images? Learn more about how Ray and Anyscale are the foundation of the open-source data engineering ecosystem.

Need to process 99% of US precedential caselaw in under $1? Scale distributed video processing to hundreds of terabytes of Creative Commons videos? Seamlessly churn through petabytes of text and images? Learn more about how Ray and Anyscale are the foundation of the open-source data engineering ecosystem.

Lightning Talk

Text / Docs

vLLM with the Transformers backend: One model definition to rule them all

Harry MellorMachine Learning EngineerHugging Face

What if you could use the same model implementation for training and inference?

With the Transformers backend for vLLM, that’s now a reality. vLLM can run any Transformers-compatible model (merged into Transformers or custom!) directly from its original definition at vLLM speeds! Your modeling code becomes the single source of truth, while vLLM handles high-performance inference, attention optimizations, and scaling.

In this session we’ll take a whirlwind tour of: what the Transformers backend is, how to make a model compatible, and how this enables teams to use the same modeling codebase for research and production.

What if you could use the same model implementation for training and inference?

With the Transformers backend for vLLM, that’s now a reality. vLLM can run any Transformers-compatible model (merged into Transformers or custom!) directly from its original definition at vLLM speeds! Your modeling code becomes the single source of truth, while vLLM handles high-performance inference, attention optimizations, and scaling.

In this session we’ll take a whirlwind tour of: what the Transformers backend is, how to make a model compatible, and how this enables teams to use the same modeling codebase for research and production.

Machine Learning

Finance

Structured Data

Building a Model Fitting Framework for Quant Finance with Ray & Anyscale

Todd GauglerQuantitative DeveloperPoint72

Quant trading and research teams at Point72/Cubist have diverse needs related to data, models, and their specific use cases. Investing in an on-premise Ray cluster enabled Ray-focused approaches, but adoption has not always been seamless. Challenges emerged around data management (loading, reuse, access), scaling (efficiently performing parallel windowed model training, sometimes on tens of terrabytes of timeseries data), and platform usage (determining how and when to utilize an Anyscale cluster).

The Cubist Core Research Technologies team has developed libraries to tackle these challenges. We'll discuss patterns we found to be successful, those that fell short, and share insights that we believe can be widely applied.

Quant trading and research teams at Point72/Cubist have diverse needs related to data, models, and their specific use cases. Investing in an on-premise Ray cluster enabled Ray-focused approaches, but adoption has not always been seamless. Challenges emerged around data management (loading, reuse, access), scaling (efficiently performing parallel windowed model training, sometimes on tens of terrabytes of timeseries data), and platform usage (determining how and when to utilize an Anyscale cluster).

The Cubist Core Research Technologies team has developed libraries to tackle these challenges. We'll discuss patterns we found to be successful, those that fell short, and share insights that we believe can be widely applied.

Machine Learning

Image

Video

Scaling Image and Video Processing with Ray

Zhibei MaMember of Technical StuffXAI

Scaling Image and Video Processing with Ray

Scaling Image and Video Processing with Ray

Engineering Lessons from Scaling Synthetic Data for Trillion-Scale Pretraining with KubeRay + vLLM

Fan PanFounding member of Technical StaffDatology AI

Bodgan GazaCo-Founder & CTODatologyAI

As language models continue to grow in capability and scale, high-quality training data has become a critical bottleneck. Synthetic data generation has emerged as a core technique for creating diverse, targeted datasets that complement organic sources and power models from lightning-fast 4.5B-parameter systems to frontier models like GPT-5.

We will share engineering lessons from building and scaling a production platform that processes trillions of tokens using KubeRay and vLLM, dynamically orchestrating thousands of GPU workers across multimodal recaptioning, rephrasing, and domain-specific content generation tasks. Topics include pushing vLLM inference to near-peak GPU utilization, designing fault-tolerant Ray actors for tensor-parallel sharding, auto-scaling KubeRay clusters to match workload patterns, and applying storage and scheduling strategies that deliver both high performance and significant cost efficiency.

We will highlight practical patterns for building resilient, scalable ML infrastructure. Attendees will learn how clean abstractions between Ray's distributed computing layer and vLLM's inference engine enable rapid iteration on prompt engineering while maintaining production stability, accelerating the journey from research prototypes to trillion-token datasets that define the next generation of AI capabilities.

As language models continue to grow in capability and scale, high-quality training data has become a critical bottleneck. Synthetic data generation has emerged as a core technique for creating diverse, targeted datasets that complement organic sources and power models from lightning-fast 4.5B-parameter systems to frontier models like GPT-5.

We will share engineering lessons from building and scaling a production platform that processes trillions of tokens using KubeRay and vLLM, dynamically orchestrating thousands of GPU workers across multimodal recaptioning, rephrasing, and domain-specific content generation tasks. Topics include pushing vLLM inference to near-peak GPU utilization, designing fault-tolerant Ray actors for tensor-parallel sharding, auto-scaling KubeRay clusters to match workload patterns, and applying storage and scheduling strategies that deliver both high performance and significant cost efficiency.

We will highlight practical patterns for building resilient, scalable ML infrastructure. Attendees will learn how clean abstractions between Ray's distributed computing layer and vLLM's inference engine enable rapid iteration on prompt engineering while maintaining production stability, accelerating the journey from research prototypes to trillion-token datasets that define the next generation of AI capabilities.

Machine Learning

Structured Data

Building RayLab: Autodesk’s Journey to Scalable Deep Learning Infrastructure

Kamal Rahimi MalekshanPrinciple Machine Learning EngineerAutodesk

Guillermo Del CastilloMachine Learning ManagerAutodesk

In this presentation, we describe Autodesk's journey to enabling large-scale deep learning across the company. We began by exploring managed solutions like AWS Batch and SageMaker, but quickly ran into challenges around scalability, customization, networking, and developer experience. To overcome these limitations, we turned to Ray and KubeRay, which offered the flexibility and control we needed. Building on top of these technologies, we developed RayLab - Autodesk's internal platform for scalable training, data processing, and model serving.

RayLab is a Kubernetes-native platform that abstracts away infrastructure complexity while supporting secure, efficient, and user-friendly workflows. We built wrappers for Ray cluster management via a CLI, Web UI, and Python SDK to simplify usage, reduce onboarding friction, and ensure compliance with Autodesk's internal security and networking requirements. We'll describe the architecture behind RayLab, which includes Kubernetes, KubeRay, Karpenter, Grafana, and JupyterHub - all secured with role-based access control and designed for multi-tenancy to support ML workspaces across teams.

RayLab provides high-level APIs built on Ray and PyTorch Lightning, allowing users to launch distributed training jobs with minimal code. It includes standardized checkpointing and experiment tracking to support reproducibility and consistent workflows.

We'll also share the challenges we faced in improving the efficiency of our reserved H100 GPU resources, particularly around fair sharing across teams and users. To address this, we implemented quota management and a priority-based scheduling system that enables high-priority jobs to preempt lower-priority ones, significantly increasing utilization. Additionally, RayLab supports automatic downscaling of underutilized clusters to conserve compute.

Finally, we'll conclude with a live demo of RayLab in action.

In this presentation, we describe Autodesk's journey to enabling large-scale deep learning across the company. We began by exploring managed solutions like AWS Batch and SageMaker, but quickly ran into challenges around scalability, customization, networking, and developer experience. To overcome these limitations, we turned to Ray and KubeRay, which offered the flexibility and control we needed. Building on top of these technologies, we developed RayLab - Autodesk's internal platform for scalable training, data processing, and model serving.

RayLab is a Kubernetes-native platform that abstracts away infrastructure complexity while supporting secure, efficient, and user-friendly workflows. We built wrappers for Ray cluster management via a CLI, Web UI, and Python SDK to simplify usage, reduce onboarding friction, and ensure compliance with Autodesk's internal security and networking requirements. We'll describe the architecture behind RayLab, which includes Kubernetes, KubeRay, Karpenter, Grafana, and JupyterHub - all secured with role-based access control and designed for multi-tenancy to support ML workspaces across teams.

RayLab provides high-level APIs built on Ray and PyTorch Lightning, allowing users to launch distributed training jobs with minimal code. It includes standardized checkpointing and experiment tracking to support reproducibility and consistent workflows.

We'll also share the challenges we faced in improving the efficiency of our reserved H100 GPU resources, particularly around fair sharing across teams and users. To address this, we implemented quota management and a priority-based scheduling system that enables high-priority jobs to preempt lower-priority ones, significantly increasing utilization. Additionally, RayLab supports automatic downscaling of underutilized clusters to conserve compute.

Finally, we'll conclude with a live demo of RayLab in action.

vLLM

Text / Docs

FlashInfer: Accelerating LLM Inference Through Unified High-Performance Kernels

Kaustubh Rao Nividia

TBC

TBC

Ray Deep Dives

Advancing KubeRay: Deepening Ecosystem Integrations for Scalable AI Workloads

Andrew Sy KimSoftware EngineerGoogle

Rueian HuangSoftware Engineering InternAnyscale

Improving user experience has been a central focus of the latest KubeRay developments. In this session, KubeRay maintainers will introduce major RayJob enhancements designed to make running and managing workloads simpler and more reliable, including deletion policies, cron scheduling, sidecar mode, and background status checks. These upgrades streamline the end-to-end job lifecycle for both developers and operators.
We will then explore new capabilities that elevate the KubeRay user experience across the Kubernetes ecosystem. Highlights include updates to the kubectl plugin for smoother user workflows, the redesigned APIServer V2 for a cleaner and more extensible control plane, and KubeRay Metrics to improve observability. We will also cover expanded Kubernetes third party scheduler supports, along with progress on in-place pod resizing integration.
Finally, we will share the roadmap for the upcoming history server, which will enable debugging for dead Ray clusters. Join us to learn how these enhancements are transforming the KubeRay experience—making it more powerful, cohesive, and user-friendly for running AI workloads on Kubernetes at scale.

Improving user experience has been a central focus of the latest KubeRay developments. In this session, KubeRay maintainers will introduce major RayJob enhancements designed to make running and managing workloads simpler and more reliable, including deletion policies, cron scheduling, sidecar mode, and background status checks. These upgrades streamline the end-to-end job lifecycle for both developers and operators.
We will then explore new capabilities that elevate the KubeRay user experience across the Kubernetes ecosystem. Highlights include updates to the kubectl plugin for smoother user workflows, the redesigned APIServer V2 for a cleaner and more extensible control plane, and KubeRay Metrics to improve observability. We will also cover expanded Kubernetes third party scheduler supports, along with progress on in-place pod resizing integration.
Finally, we will share the roadmap for the upcoming history server, which will enable debugging for dead Ray clusters. Join us to learn how these enhancements are transforming the KubeRay experience—making it more powerful, cohesive, and user-friendly for running AI workloads on Kubernetes at scale.

Ray Deep Dives

Structured Data Support for Ray Data

Alexey Kudinkin

Ray Data is a data processing engine built to target AI workloads. In the past year, we've added support for traditional tabular data workloads to Ray Data. In this talk, we'll discuss key new features like joins, expressions, and data preprocessors in Ray Data; discuss new architectural changes needed to support these features; and showcase performance improvements that we see across Ray versions from these changes. We'll end with a discussion of future roadmap and talk about upcoming challenges.

Ray Data is a data processing engine built to target AI workloads. In the past year, we've added support for traditional tabular data workloads to Ray Data. In this talk, we'll discuss key new features like joins, expressions, and data preprocessors in Ray Data; discuss new architectural changes needed to support these features; and showcase performance improvements that we see across Ray versions from these changes. We'll end with a discussion of future roadmap and talk about upcoming challenges.

LLMs

Research

Text / Docs

Marin: Open Development of Open Foundation Models

David HallResearch Engineering LeadStanford University

Open-source software thrives because its entire lifecycle remains public: code, tests, even missteps. Foundation models rarely meet that bar: most “open-weight” releases omit the training code, data recipe, and logs needed for reproducibility.

Marin closes that gap. Every run begins as a GitHub pull request that defines the hypothesis and pins the config. Ray orchestrates the job across preemptible Google Cloud TPUs, streaming metrics and depositing artifacts tied exactly to the commit. Successes, failures, and restarts remain visible, not hidden.

With this workflow we trained Marin-8B, which outperforms Llama 3.1 8B Base on 14 of 19 benchmarks. We will share lessons from scaling to 32B parameters, training MoEs on preemptible hardware, and building a Ray-based RL pipeline for agentic models, focusing on autoscaling, fault tolerance, and dataset pipelines. We will also highlight ways to get involved, from optimizers and data curation to following training logs live.

Open-source software thrives because its entire lifecycle remains public: code, tests, even missteps. Foundation models rarely meet that bar: most “open-weight” releases omit the training code, data recipe, and logs needed for reproducibility.

Marin closes that gap. Every run begins as a GitHub pull request that defines the hypothesis and pins the config. Ray orchestrates the job across preemptible Google Cloud TPUs, streaming metrics and depositing artifacts tied exactly to the commit. Successes, failures, and restarts remain visible, not hidden.

With this workflow we trained Marin-8B, which outperforms Llama 3.1 8B Base on 14 of 19 benchmarks. We will share lessons from scaling to 32B parameters, training MoEs on preemptible hardware, and building a Ray-based RL pipeline for agentic models, focusing on autoscaling, fault tolerance, and dataset pipelines. We will also highlight ways to get involved, from optimizers and data curation to following training logs live.

LLMs

Finance

Text / Docs

Structured Data

Operationalizing Ray for Real-Time Inference at Scale

Ankur MohanExecutive DirectorJPMorganChase

Dhayaneshwar BalusamyApplied AI ML Associate SrJPMorganChase

Siddartha TondapuSr Lead Software Engineer (Vice President)JPMorganChase

The KubeRay project simplifies the deployment of Ray clusters on Kubernetes. However, hardening these clusters to meet stringent production non-functional requirements (NFRs) typical of regulated environments such as those in finance requires additional engineering effort. These NFRs include multi-region and multi-AZ resilience, deployments with zero downtime, proactive autoscaling, automated production validation testing, and robust integration with monitoring and logging systems. Achieving this requires strict adherence to integration patterns and architectural discipline. This work outlines the engineering patterns and platform enhancements we applied to deploy a real-time recommendation system on a Ray cluster on EKS in production.

The system architecture routes traffic through Route53 to NLB, ALB, and finally nginx ingress to reach Ray Serve deployments running in isolated Kubernetes namespaces. Ray Serve’s proxy actors distributed across worker pods act as distributed ingress routers for balancing traffic across replicas, buffering during surges, and enabling concurrent processing without blocking. Inference workloads run on dedicated Ray clusters on separate Kubernetes namespaces managed by the centralized KubeRay Operator. Compute and network isolation is ensured by network policies and RBAC.

Autoscaling is handled at three levels: Ray Serve scales replicas based on request queue depth, KubeRay Autoscaler adjusts Ray worker pod counts based on cluster metrics, and the AWS Cluster Autoscaler provisions EC2 instances based on pending pods awaiting compute resources. This ensures responsiveness during traffic spikes while avoiding over-provisioning.

To maximize GPU utilization and reduce latency, the platform leverages Ray Serve’s dynamic batching in combination with vLLM to perform batched LLM inference. This approach ensures high-throughput, low-latency processing, especially under variable request loads, by grouping requests at runtime based on traffic characteristics.

Observability is achieved through an integrated Prometheus and Grafana stack. A PodMonitor scrapes metrics from Ray components, which are then ingested by Prometheus and visualized in Grafana for real-time analysis and alerting. In parallel, a Fluentd DaemonSet captures logs from Ray and application pods, forwarding them to AWS CloudWatch. These logs are then ingested into Splunk for centralized search, monitoring, and audit compliance. The apps are also monitored at cluster level by Dynatrace and Datadog to further enhance observability and monitoring capabilities.

To enable robust and disruption-free deployments, the platform uses a blue-green deployment pipeline built with Spinnaker. The pipeline includes progressive rollout stages, automated validation, manual approval gates, and rollback paths.

This robust system demonstrates a scalable, resilient, and observability-driven approach to deploying real-time inference and LLM workloads on Ray. Attendees will gain valuable insights into end-to-end development and deployment architecture, GPU workload optimization, and operationalization of Ray Serve in production environments.

The KubeRay project simplifies the deployment of Ray clusters on Kubernetes. However, hardening these clusters to meet stringent production non-functional requirements (NFRs) typical of regulated environments such as those in finance requires additional engineering effort. These NFRs include multi-region and multi-AZ resilience, deployments with zero downtime, proactive autoscaling, automated production validation testing, and robust integration with monitoring and logging systems. Achieving this requires strict adherence to integration patterns and architectural discipline. This work outlines the engineering patterns and platform enhancements we applied to deploy a real-time recommendation system on a Ray cluster on EKS in production.

The system architecture routes traffic through Route53 to NLB, ALB, and finally nginx ingress to reach Ray Serve deployments running in isolated Kubernetes namespaces. Ray Serve’s proxy actors distributed across worker pods act as distributed ingress routers for balancing traffic across replicas, buffering during surges, and enabling concurrent processing without blocking. Inference workloads run on dedicated Ray clusters on separate Kubernetes namespaces managed by the centralized KubeRay Operator. Compute and network isolation is ensured by network policies and RBAC.

Autoscaling is handled at three levels: Ray Serve scales replicas based on request queue depth, KubeRay Autoscaler adjusts Ray worker pod counts based on cluster metrics, and the AWS Cluster Autoscaler provisions EC2 instances based on pending pods awaiting compute resources. This ensures responsiveness during traffic spikes while avoiding over-provisioning.

To maximize GPU utilization and reduce latency, the platform leverages Ray Serve’s dynamic batching in combination with vLLM to perform batched LLM inference. This approach ensures high-throughput, low-latency processing, especially under variable request loads, by grouping requests at runtime based on traffic characteristics.

Observability is achieved through an integrated Prometheus and Grafana stack. A PodMonitor scrapes metrics from Ray components, which are then ingested by Prometheus and visualized in Grafana for real-time analysis and alerting. In parallel, a Fluentd DaemonSet captures logs from Ray and application pods, forwarding them to AWS CloudWatch. These logs are then ingested into Splunk for centralized search, monitoring, and audit compliance. The apps are also monitored at cluster level by Dynatrace and Datadog to further enhance observability and monitoring capabilities.

To enable robust and disruption-free deployments, the platform uses a blue-green deployment pipeline built with Spinnaker. The pipeline includes progressive rollout stages, automated validation, manual approval gates, and rollback paths.

This robust system demonstrates a scalable, resilient, and observability-driven approach to deploying real-time inference and LLM workloads on Ray. Attendees will gain valuable insights into end-to-end development and deployment architecture, GPU workload optimization, and operationalization of Ray Serve in production environments.

CoServe: Max performance, minimal compute

Chen XiaMember of Technical StaffCohere

Cohere is committed to build a scalable and efficient all-in-one platform for private and secure AI solutions for enterprises. On top of the vLLM library, we combine accuracy-preserving low-bit quantization solutions with extensive kernel and data communication optimization to deliver high-performance low-latency inference services with minimal compute cost. For example, our foundation model Command A series can be served with single h100 GPU at low latency while supporting more than 128K context length.

Cohere is committed to build a scalable and efficient all-in-one platform for private and secure AI solutions for enterprises. On top of the vLLM library, we combine accuracy-preserving low-bit quantization solutions with extensive kernel and data communication optimization to deliver high-performance low-latency inference services with minimal compute cost. For example, our foundation model Command A series can be served with single h100 GPU at low latency while supporting more than 128K context length.

Machine Learning

Text / Docs

Structured Data

Revolutionizing Model Serving with a 50x Cost Reduction using Ray Serve at Workday

Josh KarpelPrincipal Machine Learning EngineerWorkday

Workday uses a tenanted, regionalized architecture in order to ensure data isolation and in-region execution, both of which are crucial requirements for our customers. In early 2023, facing challenges with the ever-increasing scale and cost required to serve dedicated ML models for every tenant in every environment, we decided to completely redo how we serve models using a hot new technology: Ray! We now use Ray Serve to serve tens of thousands of ML models across more than a dozen environments. Ray Serve’s inherent capabilities of per-deployment autoscaling and efficient request routing have enabled 50x cost reductions compared to our previous systems while maintaining high availability and low latency.

But it wasn’t all smooth sailing: Ray Serve was not originally designed for some of the odd ways our system works, and we ran into scaling problems with only a few dozen applications deployed. Thankfully, Ray is open-source, and we were able to dramatically improve the scalability of Ray Serve through a series of contributions we made back to the community. Today, Ray Serve can handle thousands of applications per cluster.

In this talk, we’ll discuss how we use Ray Serve to implement model serving at Workday. We’ll dive deep into our off-the-beaten-path usage of Ray Serve, the challenges we faced with scaling up our system, and the improvements we’ve contributed back to Ray to solve them. You’ll come away from this talk with a deeper understanding of how Ray Serve works and how to build complex systems on top of it – and maybe even contribute back to it!

Workday uses a tenanted, regionalized architecture in order to ensure data isolation and in-region execution, both of which are crucial requirements for our customers. In early 2023, facing challenges with the ever-increasing scale and cost required to serve dedicated ML models for every tenant in every environment, we decided to completely redo how we serve models using a hot new technology: Ray! We now use Ray Serve to serve tens of thousands of ML models across more than a dozen environments. Ray Serve’s inherent capabilities of per-deployment autoscaling and efficient request routing have enabled 50x cost reductions compared to our previous systems while maintaining high availability and low latency.

But it wasn’t all smooth sailing: Ray Serve was not originally designed for some of the odd ways our system works, and we ran into scaling problems with only a few dozen applications deployed. Thankfully, Ray is open-source, and we were able to dramatically improve the scalability of Ray Serve through a series of contributions we made back to the community. Today, Ray Serve can handle thousands of applications per cluster.

In this talk, we’ll discuss how we use Ray Serve to implement model serving at Workday. We’ll dive deep into our off-the-beaten-path usage of Ray Serve, the challenges we faced with scaling up our system, and the improvements we’ve contributed back to Ray to solve them. You’ll come away from this talk with a deeper understanding of how Ray Serve works and how to build complex systems on top of it – and maybe even contribute back to it!

Machine Learning

Text / Docs

Image

Structured Data

Accelerating AI Pipelines with AnalyticDB Ray: Alibaba Cloud's Approach to Data-AI Convergence

Fei XueProduct ManagerAlibaba Cloud

Liang LinSenior Director of EngineeringAlibaba Cloud

Abstract: In the era of data-driven innovation, efficiently processing and analyzing multi-modal data is crucial for building effective AI pipelines. This presentation will focus on real-world applications of Alibaba Cloud's AnalyticDB Ray, showcasing how its capabilities are leveraged within a data warehouse environment to accelerate AI initiatives.

We will delve into practical use cases that demonstrate the seamless integration of multi-modal ETL and machine learning:

1. Optimizing Advertising Recommendation Inference: Learn how AnalyticDB Ray is used for offline batch inference in advertising recommendations, specifically for estimating click-through rates (CTR). This includes details on how heterogeneous resources (CPU and GPU) are independently and automatically scaled to maximize GPU utilization, achieving an increase from less than 5% to 40%. We will also discuss the dynamic auto-scaling of object storage based on data volume, which has improved data processing performance by 2 to 3 times.

2. Accelerating Large Language Model (LLM) Offline Batch Inference and Data Distillation: Discover how AnalyticDB Ray facilitates large model data preparation. We will illustrate the use of Ray Data and vLLM/SGLang for data distillation with models like Qwen and Deepseek, which then fuels large model training. Key benefits include a 2-3x improvement in data loading throughput due to caching , scheduling of 40,000 fine-grained tasks within a single Ray cluster , and a 50% performance increase for Deepseek INT8 quantization compared to FP8 in offline distillation scenarios.

3. Efficient Distributed Fine-tuning of Multi-modal Models: Explore how AnalyticDB Ray, integrated with Lance, enhances distributed image-text data processing and structuring using RayData for multi-modal personalized interactive scenarios. We will also showcase the integration with LLaMA-Factory to provide distributed fine-tuning capabilities for Qwen-VL multi-modal models. This offers a one-stop solution from data labeling to model fine-tuning and has improved distributed fine-tuning efficiency by 3-5 times.

These examples will illustrate how AnalyticDB Ray unlocks the potential for in-warehouse AI pipelines, seamlessly integrating multi-modal ETL and machine learning to accelerate the journey from data to intelligent decision-making.

Abstract: In the era of data-driven innovation, efficiently processing and analyzing multi-modal data is crucial for building effective AI pipelines. This presentation will focus on real-world applications of Alibaba Cloud's AnalyticDB Ray, showcasing how its capabilities are leveraged within a data warehouse environment to accelerate AI initiatives.

We will delve into practical use cases that demonstrate the seamless integration of multi-modal ETL and machine learning:

1. Optimizing Advertising Recommendation Inference: Learn how AnalyticDB Ray is used for offline batch inference in advertising recommendations, specifically for estimating click-through rates (CTR). This includes details on how heterogeneous resources (CPU and GPU) are independently and automatically scaled to maximize GPU utilization, achieving an increase from less than 5% to 40%. We will also discuss the dynamic auto-scaling of object storage based on data volume, which has improved data processing performance by 2 to 3 times.

2. Accelerating Large Language Model (LLM) Offline Batch Inference and Data Distillation: Discover how AnalyticDB Ray facilitates large model data preparation. We will illustrate the use of Ray Data and vLLM/SGLang for data distillation with models like Qwen and Deepseek, which then fuels large model training. Key benefits include a 2-3x improvement in data loading throughput due to caching , scheduling of 40,000 fine-grained tasks within a single Ray cluster , and a 50% performance increase for Deepseek INT8 quantization compared to FP8 in offline distillation scenarios.

3. Efficient Distributed Fine-tuning of Multi-modal Models: Explore how AnalyticDB Ray, integrated with Lance, enhances distributed image-text data processing and structuring using RayData for multi-modal personalized interactive scenarios. We will also showcase the integration with LLaMA-Factory to provide distributed fine-tuning capabilities for Qwen-VL multi-modal models. This offers a one-stop solution from data labeling to model fine-tuning and has improved distributed fine-tuning efficiency by 3-5 times.

These examples will illustrate how AnalyticDB Ray unlocks the potential for in-warehouse AI pipelines, seamlessly integrating multi-modal ETL and machine learning to accelerate the journey from data to intelligent decision-making.

IN-PERSON AGENDA

Separations of Concerns in Agentic Reinforcement Learning

Leveraging Ray, vLLM and LiteLLM built a trusted LLM service for sensitive data

Parallelizing Searches over Agentic Pipelines with Ray and syftr

Ray at Applied Intuition: RL and Inference

End‑to‑End Hybrid Reinforcement and Imitation Learning for Robotics with Ray

A foundation model for each enterprise

Optimizing GPU Utilization: Ray Train’s Distributed Solutions for Removing Training Bottlenecks

Ray Serve: Advancing scalability and flexibility

Lunch + Networking

State of vLLM 2025

Horizontal, Predictable, High-Throughput Inference for Synthetic Data Generation, Evals, and More

Improved Scheduling Flexibility with Label Selectors in Ray

Exabyte-scale Streaming Iceberg IO with Ray, Flink, and DeltaCAT

Scalable High-Performance Multi-Modal Data Curation on Ray

Scaling LLM Post-Training at Character.AI

Breakfast + Networking

From S3 Bottlenecks to Scalable I/O: Evolving Ray Pipelines with Alluxio

Ray Summit Celebration Happy Hour

Matrix: reliable framework for data-centric experimentation at scale

Breaking the Dataset Iteration Bottleneck: Real-Time ML Experimentation with Ray

Super App, Super Powers: How a Small Team Built Grab's Multi-Modal Foundation Model with Ray

JIT-Embedding with Ray Serve: Accelerating Large-Scale GenAI Foundation Model Training in Adobe Firefly

Powering the Future of LLMs: AWS and the vLLM open source project

Scaling LLM Inference with RayServe & vLLM: Building a Serverless Internal, Enterprise Model Hosting Platform

Ray Direct Transport: Direct GPU-GPU communication in Ray Core

Synthetic data generation with ray data + serve + vLLM

Mako: Netflix's Next Generation ML Training Platform

Optimizing Video AI at Scale: Cost-Effective ML Operations with Geotab and Anyscale Ray

How Ray Is Powering The Open-source Data Ecosystem

vLLM with the Transformers backend: One model definition to rule them all

Building a Model Fitting Framework for Quant Finance with Ray & Anyscale

Scaling Image and Video Processing with Ray

Engineering Lessons from Scaling Synthetic Data for Trillion-Scale Pretraining with KubeRay + vLLM

Building RayLab: Autodesk’s Journey to Scalable Deep Learning Infrastructure

FlashInfer: Accelerating LLM Inference Through Unified High-Performance Kernels

Advancing KubeRay: Deepening Ecosystem Integrations for Scalable AI Workloads

Structured Data Support for Ray Data

Marin: Open Development of Open Foundation Models

Operationalizing Ray for Real-Time Inference at Scale

CoServe: Max performance, minimal compute

Revolutionizing Model Serving with a 50x Cost Reduction using Ray Serve at Workday

Accelerating AI Pipelines with AnalyticDB Ray: Alibaba Cloud's Approach to Data-AI Convergence