RAY SUMMIT 2025
IN-PERSON AGENDA
Reinforcement Learning
Separations of Concerns in Agentic Reinforcement Learning
This talk surveys a number of design choices and tradeoffs which arise in building performant distributed training infrastructure for agentic reinforcement learning, and offers an opinionated perspective on how modern research efforts require rethinking standard abstractions and separations of concerns. Using verifiers a case study, we argue that much sharper boundaries should be drawn between trainers and environments than is often observed in popular libraries. In particular, we propose that RL trainers ought to adopt OpenAI Chat Completions as a universal inference spec, and the role of the actor (from the trainer's perspective) should simply be to expose a generic inference endpoint to be used within environments. Further, we argue that orchestration of requests within rollouts (e.g. inference, code execution) and orchestration of requests within training (e.g. weight updates) should be entirely decoupled, and that trainers should not be aware of any environment details modulo high-level spec features (e.g. whether or not multimodality or branching rollouts are required). We view both of these requirements as crucial in order to 1) be able to source large volumes of high-quality train-ready environments, 2) allow environments and trainers to be largely model-agnostic, and 3) ensure that trained models can be directly used in applications of interest without friction.
Read more
This talk surveys a number of design choices and tradeoffs which arise in building performant distributed training infrastructure for agentic reinforcement learning, and offers an opinionated perspective on how modern research efforts require rethinking standard abstractions and separations of concerns. Using verifiers a case study, we argue that much sharper boundaries should be drawn between trainers and environments than is often observed in popular libraries. In particular, we propose that RL trainers ought to adopt OpenAI Chat Completions as a universal inference spec, and the role of the actor (from the trainer's perspective) should simply be to expose a generic inference endpoint to be used within environments. Further, we argue that orchestration of requests within rollouts (e.g. inference, code execution) and orchestration of requests within training (e.g. weight updates) should be entirely decoupled, and that trainers should not be aware of any environment details modulo high-level spec features (e.g. whether or not multimodality or branching rollouts are required). We view both of these requirements as crucial in order to 1) be able to source large volumes of high-quality train-ready environments, 2) allow environments and trainers to be largely model-agnostic, and 3) ensure that trained models can be directly used in applications of interest without friction.
Read moreLightning Talk
Finance
Text / Docs
Leveraging Ray, vLLM and LiteLLM built a trusted LLM service for sensitive data
In H1 2025, Coinbase MLP team built a trusted LLM services by leveraging Ray, vLLM and LiteLLM to make sure Coinbase as the most trusted Crypto exchange.
In this talk, we will go through technical details of user authentication, s2s, litellm distribution, vLLM and Ray to share the whole story of how Coinbase uses Ray and vLLM to build LLM serving API and support internal LLM traffic.
Read more
In H1 2025, Coinbase MLP team built a trusted LLM services by leveraging Ray, vLLM and LiteLLM to make sure Coinbase as the most trusted Crypto exchange.
In this talk, we will go through technical details of user authentication, s2s, litellm distribution, vLLM and Ray to share the whole story of how Coinbase uses Ray and vLLM to build LLM serving API and support internal LLM traffic.
In this talk, we will go through technical details of user authentication, s2s, litellm distribution, vLLM and Ray to share the whole story of how Coinbase uses Ray and vLLM to build LLM serving API and support internal LLM traffic.
Lightning Talk
Text / Docs
Structured Data
Parallelizing Searches over Agentic Pipelines with Ray and syftr
Agentic pipelines are central to applying large language models (LLMs) to proprietary or dynamic data. However, constructing efficient agentic flows presents significant challenges. It necessitates precise selection among various components, including vector databases, embedding models, text splitters, retrievers, and synthesizing LLMs. Further complicating this process is the meticulous tuning required for modules such as verifiers, rewriters, and rerankers, each with their own intricate hyperparameter dependencies. In performance-sensitive applications, manually balancing the tradeoffs between latency, accuracy, and cost becomes progressively more difficult.
We introduce syftr, a framework that performs efficient, distributed, multi-objective search over a vast (1023) space of agentic and non-agentic configurations . Using advances in Bayesian Optimization, syftr discovers Pareto-optimal flows that jointly optimize task accuracy and cost. A novel early-stopping mechanism further improves efficiency by pruning clearly suboptimal candidates. Across multiple benchmarks, syftr finds flows which are on average ≈ 9× cheaper while preserving most of the accuracy of the most accurate flows on the Pareto-frontier. Furthermore, syftr’s ability to design and optimize also allows easy integration of new modules, making it even easier and faster to realize high-performing generative AI pipelines.
Building syftr is especially challenging from an infrastructure point-of-view. For example, one of the most compute intensive parts of the search process is Vector Database (VDBs) construction. As syftr is trying out multiple embedding models, chunk sizes, etc, VDB construction for large datasets forms a large part of the search compute. Small embedding models can be run on CPU (cheap and plenty) while larger ones require GPUs (expensive and scarce). syftr utilizes Ray to distribute this workload in heterogenous compute clusters of CPUs and different GPU SKUs (T4, A100, H100 etc). When we self-host OSS models in the search space, syftr creates inference load hotspots as the optimizer hones in on a few LLMs and embedding models which are part of flows on the Pareto-frontier. Ray Serve provides a way to autoscale high demand models while scaling to zero cold models.
In this talk we go deep into how Ray’s unique abilities of scale, robustness and ease-of-use accelerates research like syftr at the intersection of AI and AI infrastructure.
Paper: https://arxiv.org/abs/2505.20266 (AutoML 2025)
Code: https://github.com/datarobot/syftr
Blog: https://www.datarobot.com/blog/pareto-optimized-ai-workflows-syftr
Read more
Agentic pipelines are central to applying large language models (LLMs) to proprietary or dynamic data. However, constructing efficient agentic flows presents significant challenges. It necessitates precise selection among various components, including vector databases, embedding models, text splitters, retrievers, and synthesizing LLMs. Further complicating this process is the meticulous tuning required for modules such as verifiers, rewriters, and rerankers, each with their own intricate hyperparameter dependencies. In performance-sensitive applications, manually balancing the tradeoffs between latency, accuracy, and cost becomes progressively more difficult.
We introduce syftr, a framework that performs efficient, distributed, multi-objective search over a vast (1023) space of agentic and non-agentic configurations . Using advances in Bayesian Optimization, syftr discovers Pareto-optimal flows that jointly optimize task accuracy and cost. A novel early-stopping mechanism further improves efficiency by pruning clearly suboptimal candidates. Across multiple benchmarks, syftr finds flows which are on average ≈ 9× cheaper while preserving most of the accuracy of the most accurate flows on the Pareto-frontier. Furthermore, syftr’s ability to design and optimize also allows easy integration of new modules, making it even easier and faster to realize high-performing generative AI pipelines.
Building syftr is especially challenging from an infrastructure point-of-view. For example, one of the most compute intensive parts of the search process is Vector Database (VDBs) construction. As syftr is trying out multiple embedding models, chunk sizes, etc, VDB construction for large datasets forms a large part of the search compute. Small embedding models can be run on CPU (cheap and plenty) while larger ones require GPUs (expensive and scarce). syftr utilizes Ray to distribute this workload in heterogenous compute clusters of CPUs and different GPU SKUs (T4, A100, H100 etc). When we self-host OSS models in the search space, syftr creates inference load hotspots as the optimizer hones in on a few LLMs and embedding models which are part of flows on the Pareto-frontier. Ray Serve provides a way to autoscale high demand models while scaling to zero cold models.
In this talk we go deep into how Ray’s unique abilities of scale, robustness and ease-of-use accelerates research like syftr at the intersection of AI and AI infrastructure.
Paper: https://arxiv.org/abs/2505.20266 (AutoML 2025)
Code: https://github.com/datarobot/syftr
Blog: https://www.datarobot.com/blog/pareto-optimized-ai-workflows-syftr
We introduce syftr, a framework that performs efficient, distributed, multi-objective search over a vast (1023) space of agentic and non-agentic configurations . Using advances in Bayesian Optimization, syftr discovers Pareto-optimal flows that jointly optimize task accuracy and cost. A novel early-stopping mechanism further improves efficiency by pruning clearly suboptimal candidates. Across multiple benchmarks, syftr finds flows which are on average ≈ 9× cheaper while preserving most of the accuracy of the most accurate flows on the Pareto-frontier. Furthermore, syftr’s ability to design and optimize also allows easy integration of new modules, making it even easier and faster to realize high-performing generative AI pipelines.
Building syftr is especially challenging from an infrastructure point-of-view. For example, one of the most compute intensive parts of the search process is Vector Database (VDBs) construction. As syftr is trying out multiple embedding models, chunk sizes, etc, VDB construction for large datasets forms a large part of the search compute. Small embedding models can be run on CPU (cheap and plenty) while larger ones require GPUs (expensive and scarce). syftr utilizes Ray to distribute this workload in heterogenous compute clusters of CPUs and different GPU SKUs (T4, A100, H100 etc). When we self-host OSS models in the search space, syftr creates inference load hotspots as the optimizer hones in on a few LLMs and embedding models which are part of flows on the Pareto-frontier. Ray Serve provides a way to autoscale high demand models while scaling to zero cold models.
In this talk we go deep into how Ray’s unique abilities of scale, robustness and ease-of-use accelerates research like syftr at the intersection of AI and AI infrastructure.
Paper: https://arxiv.org/abs/2505.20266 (AutoML 2025)
Code: https://github.com/datarobot/syftr
Blog: https://www.datarobot.com/blog/pareto-optimized-ai-workflows-syftr
Reinforcement Learning
Physical AI
Image
Structured Data
Video
Ray at Applied Intuition: RL and Inference
Applied Intuition uses Ray to scale reinforcement learning and inference workloads that operate on petabytes of raw sensor data for autonomous driving.
For reinforcement learning, Ray’s distributed execution model and RLlib enable scalable open- and closed-loop training. We leverage Ray to run thousands of parallel environment rollouts, colocate GPU learners and simulators efficiently, and recover full state during training. Our hybrid setup builds on the new RLlib API to support both imitation and closed-loop regimes — allowing driving models to generalize to out-of-distribution scenarios.
For inference, Ray Data powers large-scale batch processing of sensor samples, providing a unified, high-throughput interface for streaming data from our lake and performing CPU-intensive transformations before GPU inference. The system scales seamlessly across Kubernetes, making it easy to transition from development to production.
Read more
Applied Intuition uses Ray to scale reinforcement learning and inference workloads that operate on petabytes of raw sensor data for autonomous driving.
For reinforcement learning, Ray’s distributed execution model and RLlib enable scalable open- and closed-loop training. We leverage Ray to run thousands of parallel environment rollouts, colocate GPU learners and simulators efficiently, and recover full state during training. Our hybrid setup builds on the new RLlib API to support both imitation and closed-loop regimes — allowing driving models to generalize to out-of-distribution scenarios.
For inference, Ray Data powers large-scale batch processing of sensor samples, providing a unified, high-throughput interface for streaming data from our lake and performing CPU-intensive transformations before GPU inference. The system scales seamlessly across Kubernetes, making it easy to transition from development to production.
For reinforcement learning, Ray’s distributed execution model and RLlib enable scalable open- and closed-loop training. We leverage Ray to run thousands of parallel environment rollouts, colocate GPU learners and simulators efficiently, and recover full state during training. Our hybrid setup builds on the new RLlib API to support both imitation and closed-loop regimes — allowing driving models to generalize to out-of-distribution scenarios.
For inference, Ray Data powers large-scale batch processing of sensor samples, providing a unified, high-throughput interface for streaming data from our lake and performing CPU-intensive transformations before GPU inference. The system scales seamlessly across Kubernetes, making it easy to transition from development to production.
Reinforcement Learning
Physical AI
Research
Structured Data
End‑to‑End Hybrid Reinforcement and Imitation Learning for Robotics with Ray
The nature of machine learning in robotics demands complex abstractions of hardware and training/simulation layers to use a combination of RL and IL (Imitation Learning). In this respect, policy learning for robotics rarely fits on one kind of machine. For instance, massive simulation parallelization with GPU physics and rendering in Isaac Lab demand RTX‑class GPUs, while policy training benefits from large VRAM and FLOPs. Over the past year we have built our infra on Ray that hides this hardware/software diversity and lets researchers focus on science, not sys‑admin.
Our platform offers:
- Unified orchestration – a single Ray workflow allows to train full state RL models that are used to train multi-task IL policy, and evaluation in simulation.
- Heterogeneous GPU scheduling – placement groups assign Isaac Lab simulators to RTX workers and gradient computation to A100/H100 trainers without manual mapping.
- Isolated deployment targets – the same job definition that trains a policy can package it into a lightweight Ray Serve micro‑service that runs next to the robot or on a nearby edge server, shielding control code from research churn.
During the live demo we will:
- Launch a hybrid RL‑IL run that automatically provisions both Nvidia-RTX GPUs and A100/H100 nodes.
- Watch Ray adapt the cluster as workloads shift from simulation to learning to evaluation.
- Deploy the resulting policy to an isolated runtime on the robot—ready for immediate testing.
Attendees will leave with practical design patterns for juggling simulator‑heavy and large‑scale network training inside one reproducible Ray ecosystem, plus insights on meeting real‑time robotics constraints while remaining GPU‑efficient.
Read more
The nature of machine learning in robotics demands complex abstractions of hardware and training/simulation layers to use a combination of RL and IL (Imitation Learning). In this respect, policy learning for robotics rarely fits on one kind of machine. For instance, massive simulation parallelization with GPU physics and rendering in Isaac Lab demand RTX‑class GPUs, while policy training benefits from large VRAM and FLOPs. Over the past year we have built our infra on Ray that hides this hardware/software diversity and lets researchers focus on science, not sys‑admin.
Our platform offers:
- Unified orchestration – a single Ray workflow allows to train full state RL models that are used to train multi-task IL policy, and evaluation in simulation.
- Heterogeneous GPU scheduling – placement groups assign Isaac Lab simulators to RTX workers and gradient computation to A100/H100 trainers without manual mapping.
- Isolated deployment targets – the same job definition that trains a policy can package it into a lightweight Ray Serve micro‑service that runs next to the robot or on a nearby edge server, shielding control code from research churn.
During the live demo we will:
- Launch a hybrid RL‑IL run that automatically provisions both Nvidia-RTX GPUs and A100/H100 nodes.
- Watch Ray adapt the cluster as workloads shift from simulation to learning to evaluation.
- Deploy the resulting policy to an isolated runtime on the robot—ready for immediate testing.
Attendees will leave with practical design patterns for juggling simulator‑heavy and large‑scale network training inside one reproducible Ray ecosystem, plus insights on meeting real‑time robotics constraints while remaining GPU‑efficient.
Our platform offers:
- Unified orchestration – a single Ray workflow allows to train full state RL models that are used to train multi-task IL policy, and evaluation in simulation.
- Heterogeneous GPU scheduling – placement groups assign Isaac Lab simulators to RTX workers and gradient computation to A100/H100 trainers without manual mapping.
- Isolated deployment targets – the same job definition that trains a policy can package it into a lightweight Ray Serve micro‑service that runs next to the robot or on a nearby edge server, shielding control code from research churn.
During the live demo we will:
- Launch a hybrid RL‑IL run that automatically provisions both Nvidia-RTX GPUs and A100/H100 nodes.
- Watch Ray adapt the cluster as workloads shift from simulation to learning to evaluation.
- Deploy the resulting policy to an isolated runtime on the robot—ready for immediate testing.
Attendees will leave with practical design patterns for juggling simulator‑heavy and large‑scale network training inside one reproducible Ray ecosystem, plus insights on meeting real‑time robotics constraints while remaining GPU‑efficient.
Machine Learning
Media & Gaming
Structured Data
A foundation model for each enterprise
B2C enterprises possess vast and diverse datasets and often gain value from predicting user behavior. Yet their predictive modeling capabilities are fragmented into siloed, single-task systems. This approach creates redundant feature engineering, incurs excessive training costs, and lacks the flexibility to adapt to new predictive tasks.
We argue, based on our experience at Dream11, that the paradigm of large-scale, pre-trained foundation models can be extended beyond the domain of language to create a single, cohesive user-intelligence engine for any enterprise. This talk introduces Lumos, a framework for building enterprise-specific foundation models, and details how Ray played a crucial role in overcoming the significant scaling challenges involved.
The core of Lumos is a multi-task, multi-timestep Transformer architecture designed to forecast a wide array of user behaviors (e.g., churn, engagement, lifetime value) by ingesting historical user transactions, user attributes, and the calendar of future business events (supply). Our architecture introduces a strong inductive bias through a cross-attention mechanism, where a decoder conditioned on future supply events attends to an encoder that has processed the full history of user behavior. We will also share our work on formulating the equivalents of scaling laws and emergent abilities in the realm of enterprise transaction data. While Ray datasets helped us efficiently load data from blob storage into GPUs, Ray train helped us distribute training across multi-node, multi-GPU clusters without needing any code changes.
Using these modules, we were able to efficiently train models using over 50 terabytes of user data across multiple nodes and GPUs, achieving high accuracy and notable improvements in online metrics.
Read more
B2C enterprises possess vast and diverse datasets and often gain value from predicting user behavior. Yet their predictive modeling capabilities are fragmented into siloed, single-task systems. This approach creates redundant feature engineering, incurs excessive training costs, and lacks the flexibility to adapt to new predictive tasks.
We argue, based on our experience at Dream11, that the paradigm of large-scale, pre-trained foundation models can be extended beyond the domain of language to create a single, cohesive user-intelligence engine for any enterprise. This talk introduces Lumos, a framework for building enterprise-specific foundation models, and details how Ray played a crucial role in overcoming the significant scaling challenges involved.
The core of Lumos is a multi-task, multi-timestep Transformer architecture designed to forecast a wide array of user behaviors (e.g., churn, engagement, lifetime value) by ingesting historical user transactions, user attributes, and the calendar of future business events (supply). Our architecture introduces a strong inductive bias through a cross-attention mechanism, where a decoder conditioned on future supply events attends to an encoder that has processed the full history of user behavior. We will also share our work on formulating the equivalents of scaling laws and emergent abilities in the realm of enterprise transaction data. While Ray datasets helped us efficiently load data from blob storage into GPUs, Ray train helped us distribute training across multi-node, multi-GPU clusters without needing any code changes.
Using these modules, we were able to efficiently train models using over 50 terabytes of user data across multiple nodes and GPUs, achieving high accuracy and notable improvements in online metrics.
We argue, based on our experience at Dream11, that the paradigm of large-scale, pre-trained foundation models can be extended beyond the domain of language to create a single, cohesive user-intelligence engine for any enterprise. This talk introduces Lumos, a framework for building enterprise-specific foundation models, and details how Ray played a crucial role in overcoming the significant scaling challenges involved.
The core of Lumos is a multi-task, multi-timestep Transformer architecture designed to forecast a wide array of user behaviors (e.g., churn, engagement, lifetime value) by ingesting historical user transactions, user attributes, and the calendar of future business events (supply). Our architecture introduces a strong inductive bias through a cross-attention mechanism, where a decoder conditioned on future supply events attends to an encoder that has processed the full history of user behavior. We will also share our work on formulating the equivalents of scaling laws and emergent abilities in the realm of enterprise transaction data. While Ray datasets helped us efficiently load data from blob storage into GPUs, Ray train helped us distribute training across multi-node, multi-GPU clusters without needing any code changes.
Using these modules, we were able to efficiently train models using over 50 terabytes of user data across multiple nodes and GPUs, achieving high accuracy and notable improvements in online metrics.
Ray Deep Dives
Optimizing GPU Utilization: Ray Train’s Distributed Solutions for Removing Training Bottlenecks
TBD
Read more
TBD
Read moreRay Deep Dives
Ray Serve: Advancing scalability and flexibility
Ray Serve has become one of the most popular libraries for modern AI applications. Unlike most online inference frameworks, Ray Serve is natively built to support multiple models, any hardware, and inference framework. This session will highlight significant advancements in Ray Serve last year including - 1. flexibility to capture more complex inference patterns, 2. performance at scale, 3. multi-cloud inference.
Read more
Ray Serve has become one of the most popular libraries for modern AI applications. Unlike most online inference frameworks, Ray Serve is natively built to support multiple models, any hardware, and inference framework. This session will highlight significant advancements in Ray Serve last year including - 1. flexibility to capture more complex inference patterns, 2. performance at scale, 3. multi-cloud inference.
Read moreLunch + Networking
Grab lunch and explore Rayground, where you'll find Anyscale demos, sponsor booths, and the Lightning Theater.
Read more
Grab lunch and explore Rayground, where you'll find Anyscale demos, sponsor booths, and the Lightning Theater.
Read morevLLM
Text / Docs
State of vLLM 2025
In this talk, we will cover the latest one year in review for the vLLM project and discuss the road ahead.
Read more
In this talk, we will cover the latest one year in review for the vLLM project and discuss the road ahead.
Read moreLightning Talk
Text / Docs
Horizontal, Predictable, High-Throughput Inference for Synthetic Data Generation, Evals, and More
Sutro (https://sutro.sh/) is an accelerated batch inference service. We use vLLM under the hood to power offline inference workloads ranging from a few hundred to tens of billions of tokens, often for synthetic data generation, evals, or processing unstructured data. It's critical for us to be able to use vLLM in a predictable way - from a cost, performance, and transparency standpoint. In this talk we'll explain how we use vLLM under the hood, from our custom implementation, to our performance profiler, throughput estimation algorithms, and cost attribution instrumentation. This talk is geared towards teams looking to push the boundaries of what's possible with vLLM at scale.
Read more
Sutro (https://sutro.sh/) is an accelerated batch inference service. We use vLLM under the hood to power offline inference workloads ranging from a few hundred to tens of billions of tokens, often for synthetic data generation, evals, or processing unstructured data. It's critical for us to be able to use vLLM in a predictable way - from a cost, performance, and transparency standpoint. In this talk we'll explain how we use vLLM under the hood, from our custom implementation, to our performance profiler, throughput estimation algorithms, and cost attribution instrumentation. This talk is geared towards teams looking to push the boundaries of what's possible with vLLM at scale.
Read moreLightning Talk
Structured Data
Improved Scheduling Flexibility with Label Selectors in Ray
Acquiring scarce accelerator resources for Ray applications on a heterogeneous cluster can be challenging due to different accelerator type and topology requirements and limited availability. These issues previously required workarounds such as setting custom resources and accelerator_type.
Ray's new Label Selector API helps alleviate these challenges by enabling users to schedule tasks, actors, and placement groups using Ray node labels specified at RayCluster creation by the user and detected automatically by Ray. This API offers support for both static and auto-scaling RayClusters, fallback strategies, and per-bundle selectors, enabling users to make precise placement decisions at the application level. This functionality is incorporated in the Anyscale platform, Ray dashboard, and KubeRay. The same user code operates identically across platforms.
This talk will primarily explore common use cases, API modifications, and a live demo highlighting how the new label selector API enhances scheduling flexibility.
Read more
Acquiring scarce accelerator resources for Ray applications on a heterogeneous cluster can be challenging due to different accelerator type and topology requirements and limited availability. These issues previously required workarounds such as setting custom resources and accelerator_type.
Ray's new Label Selector API helps alleviate these challenges by enabling users to schedule tasks, actors, and placement groups using Ray node labels specified at RayCluster creation by the user and detected automatically by Ray. This API offers support for both static and auto-scaling RayClusters, fallback strategies, and per-bundle selectors, enabling users to make precise placement decisions at the application level. This functionality is incorporated in the Anyscale platform, Ray dashboard, and KubeRay. The same user code operates identically across platforms.
This talk will primarily explore common use cases, API modifications, and a live demo highlighting how the new label selector API enhances scheduling flexibility.
Ray's new Label Selector API helps alleviate these challenges by enabling users to schedule tasks, actors, and placement groups using Ray node labels specified at RayCluster creation by the user and detected automatically by Ray. This API offers support for both static and auto-scaling RayClusters, fallback strategies, and per-bundle selectors, enabling users to make precise placement decisions at the application level. This functionality is incorporated in the Anyscale platform, Ray dashboard, and KubeRay. The same user code operates identically across platforms.
This talk will primarily explore common use cases, API modifications, and a live demo highlighting how the new label selector API enhances scheduling flexibility.
Machine Learning
Structured Data
Exabyte-scale Streaming Iceberg IO with Ray, Flink, and DeltaCAT
Production case study highlighting how Amazon uses Ray and DeltaCAT at exabyte-scale to resolve longstanding performance & scale challenges integrating streaming pipelines with Apache Iceberg. Highlights how the Apache Flink, Ray, Apache Beam, and Apache Spark communities can start bringing the same benefits to their workloads using DeltaCAT's Iceberg Table management jobs on Ray together with Flink and Beam.
Read more
Production case study highlighting how Amazon uses Ray and DeltaCAT at exabyte-scale to resolve longstanding performance & scale challenges integrating streaming pipelines with Apache Iceberg. Highlights how the Apache Flink, Ray, Apache Beam, and Apache Spark communities can start bringing the same benefits to their workloads using DeltaCAT's Iceberg Table management jobs on Ray together with Flink and Beam.
Read moreMachine Learning
Image
Structured Data
Video
Scalable High-Performance Multi-Modal Data Curation on Ray
Processing petabyte-scale, multi-modal data for Generative AI - spanning text, video, audio, and more—is a complex distributed systems challenge. These pipelines require a framework capable of handling heterogeneous workloads, stateful operations like deduplication, and GPU acceleration. This session explores architectural patterns for building such pipelines using Ray.
Drawing on our experience building NVIDIA NeMo Curator - we demonstrate how Ray’s primitives enable efficient, scalable data processing. We will cover how to leverage Ray Actors for stateful, long-running tasks and Ray Tasks for stateless parallel transformations, managing heterogeneous CPU/GPU resources to maximize throughput and pipeline robustness.
Read more
Processing petabyte-scale, multi-modal data for Generative AI - spanning text, video, audio, and more—is a complex distributed systems challenge. These pipelines require a framework capable of handling heterogeneous workloads, stateful operations like deduplication, and GPU acceleration. This session explores architectural patterns for building such pipelines using Ray.
Drawing on our experience building NVIDIA NeMo Curator - we demonstrate how Ray’s primitives enable efficient, scalable data processing. We will cover how to leverage Ray Actors for stateful, long-running tasks and Ray Tasks for stateless parallel transformations, managing heterogeneous CPU/GPU resources to maximize throughput and pipeline robustness.
Drawing on our experience building NVIDIA NeMo Curator - we demonstrate how Ray’s primitives enable efficient, scalable data processing. We will cover how to leverage Ray Actors for stateful, long-running tasks and Ray Tasks for stateless parallel transformations, managing heterogeneous CPU/GPU resources to maximize throughput and pipeline robustness.
LLMs
Media & Gaming
Text / Docs
Scaling LLM Post-Training at Character.AI
Character AI is the world's leading application for AI entertainment, serving tens of millions of users per day with large language models (LLMs). To continuously improve the models that power our AI Characters, we have built a robust and scalable post-training stack entirely on open-source technologies in the Ray ecosystem. Our fine-tuning stack, internally named Rayman, has allowed us to accelerate our model development velocity and large MoE model training efficiency. We are also able to utilize and adapt open-source RL libraries(Verl) to deal with our unique challenges in RL training. In this talk, we will detail the architecture of Rayman, the open-source projects we leverage, our RL framework and the ML challenges we've overcome.
Specifically, we will cover:
1. Infrastructure for Fine-tuning and Distillation: We will introduce Rayman, our internal framework built on Ray Data, Ray Train, and Deepspeed/PyTorch FSDP/Internal Pipeline SFT for orchestrating all distributed workloads. We'll detail how we use this for large-scale SFT, DPO including our strategy for training massive Mixture-of-Experts (MoE) models like those from DeepSeek. We will also go over our approach for knowledge distillation of state of the art open source LLMs into smaller, more tractable models.
2. Reinforcement Learning from Real User Feedback: A core challenge in aligning models for open-ended creative dialogue is that there are no verifiable rewards. We will discuss how we tackle this problem by training our own reward models on real user interaction data which we then use for RL. We'll detail our RL framework built on top of Verl, which allows us to translate noisy, real-world user feedback into a clear signal that can be effectively "hill-climbed" using a variety of reinforcement learning techniques to significantly improve the quality of our models.
Read more
Character AI is the world's leading application for AI entertainment, serving tens of millions of users per day with large language models (LLMs). To continuously improve the models that power our AI Characters, we have built a robust and scalable post-training stack entirely on open-source technologies in the Ray ecosystem. Our fine-tuning stack, internally named Rayman, has allowed us to accelerate our model development velocity and large MoE model training efficiency. We are also able to utilize and adapt open-source RL libraries(Verl) to deal with our unique challenges in RL training. In this talk, we will detail the architecture of Rayman, the open-source projects we leverage, our RL framework and the ML challenges we've overcome.
Specifically, we will cover:
1. Infrastructure for Fine-tuning and Distillation: We will introduce Rayman, our internal framework built on Ray Data, Ray Train, and Deepspeed/PyTorch FSDP/Internal Pipeline SFT for orchestrating all distributed workloads. We'll detail how we use this for large-scale SFT, DPO including our strategy for training massive Mixture-of-Experts (MoE) models like those from DeepSeek. We will also go over our approach for knowledge distillation of state of the art open source LLMs into smaller, more tractable models.
2. Reinforcement Learning from Real User Feedback: A core challenge in aligning models for open-ended creative dialogue is that there are no verifiable rewards. We will discuss how we tackle this problem by training our own reward models on real user interaction data which we then use for RL. We'll detail our RL framework built on top of Verl, which allows us to translate noisy, real-world user feedback into a clear signal that can be effectively "hill-climbed" using a variety of reinforcement learning techniques to significantly improve the quality of our models.
Specifically, we will cover:
1. Infrastructure for Fine-tuning and Distillation: We will introduce Rayman, our internal framework built on Ray Data, Ray Train, and Deepspeed/PyTorch FSDP/Internal Pipeline SFT for orchestrating all distributed workloads. We'll detail how we use this for large-scale SFT, DPO including our strategy for training massive Mixture-of-Experts (MoE) models like those from DeepSeek. We will also go over our approach for knowledge distillation of state of the art open source LLMs into smaller, more tractable models.
2. Reinforcement Learning from Real User Feedback: A core challenge in aligning models for open-ended creative dialogue is that there are no verifiable rewards. We will discuss how we tackle this problem by training our own reward models on real user interaction data which we then use for RL. We'll detail our RL framework built on top of Verl, which allows us to translate noisy, real-world user feedback into a clear signal that can be effectively "hill-climbed" using a variety of reinforcement learning techniques to significantly improve the quality of our models.
Breakfast + Networking
Enjoy a light breakfast and coffee while mingling with Ray attendees
Read more
Enjoy a light breakfast and coffee while mingling with Ray attendees
Read moreLightning Talk
Text / Docs
Structured Data
From S3 Bottlenecks to Scalable I/O: Evolving Ray Pipelines with Alluxio
Fast model iteration is essential for AI startups, especially those building advanced document understanding and search capabilities. In this talk, we walk through the evolution of a Ray-based training pipeline from a real-world use case, designed to accelerate experimentation and reduce deployment latency for new search models. We compare three architectural designs: direct S3 access from Ray, Alluxio as an async write buffer, and Alluxio with fully decentralized cache-through writes.
In the initial setup, direct I/O between Ray for preprocessing and PyTorch for training struggled under high concurrency—over 1,000 Ray workers writing during preprocessing—leading to poor throughput and high S3 egress costs. A second iteration introduced Alluxio with async writes, which improved training-stage read performance, but metadata saturation and write path instability led to frequent cluster crashes and job restarts.
Migrating to a decentralized Alluxio 3.x setup with cache-through writes enabled stable ingestion under bursty 400 Gbps internal bandwidth, while throttling 10 Gbps outbound writes to S3. The result: no restarts, better GPU utilization, and fast, scalable iteration. We’ll share practical lessons on turning a fragile I/O stack into a production-grade AI data infrastructure.
Read more
Fast model iteration is essential for AI startups, especially those building advanced document understanding and search capabilities. In this talk, we walk through the evolution of a Ray-based training pipeline from a real-world use case, designed to accelerate experimentation and reduce deployment latency for new search models. We compare three architectural designs: direct S3 access from Ray, Alluxio as an async write buffer, and Alluxio with fully decentralized cache-through writes.
In the initial setup, direct I/O between Ray for preprocessing and PyTorch for training struggled under high concurrency—over 1,000 Ray workers writing during preprocessing—leading to poor throughput and high S3 egress costs. A second iteration introduced Alluxio with async writes, which improved training-stage read performance, but metadata saturation and write path instability led to frequent cluster crashes and job restarts.
Migrating to a decentralized Alluxio 3.x setup with cache-through writes enabled stable ingestion under bursty 400 Gbps internal bandwidth, while throttling 10 Gbps outbound writes to S3. The result: no restarts, better GPU utilization, and fast, scalable iteration. We’ll share practical lessons on turning a fragile I/O stack into a production-grade AI data infrastructure.
In the initial setup, direct I/O between Ray for preprocessing and PyTorch for training struggled under high concurrency—over 1,000 Ray workers writing during preprocessing—leading to poor throughput and high S3 egress costs. A second iteration introduced Alluxio with async writes, which improved training-stage read performance, but metadata saturation and write path instability led to frequent cluster crashes and job restarts.
Migrating to a decentralized Alluxio 3.x setup with cache-through writes enabled stable ingestion under bursty 400 Gbps internal bandwidth, while throttling 10 Gbps outbound writes to S3. The result: no restarts, better GPU utilization, and fast, scalable iteration. We’ll share practical lessons on turning a fragile I/O stack into a production-grade AI data infrastructure.
Ray Summit Celebration Happy Hour
Join colleagues and friends in the Rayground for demos, sponsors and Ray conversations.
Read more
Join colleagues and friends in the Rayground for demos, sponsors and Ray conversations.
Read moreMatrix: reliable framework for data-centric experimentation at scale
Scaled and high quality data is the oil driving progress of AGI in research and development. Thanks to the foundational works such as Ray, Slurm, and vLLM, it becomes a lot easier to manage compute resources at scale and access to a diverse set of SOTA LLMs. However, these efforts are often designed for experienced engineers with entry barriers for researchers to unleash their full potential. Thus, in Fundamental AI Research Lab (FAIR) at Meta, we have built Matrix, a reliable framework for data-centric experimentation at scale, to connect these foundational pieces for researchers to quickly iterate on their ideas and build experiments with large-scale models and data.
Matrix supports robust and auto-scaled data generation from LLMs, game engine, and physics or world model simulators, with one command. It also offers easy setup for scalable data processing and augmentation such as LLM-as-a-judge in batch, safe code execution for verification, and data dedup, classification, or clustering. The framework also offers efficient and reproducible evaluation pipelines for large teams to collaborate on.
Matrix is widely used to empower Meta’s research and production bets in AGI including MLLMs and world modeling. In this session, we will introduce the Matrix framework from its design and synergy with other industry initiatives like Ray and vLLM, to the research and production use cases Matrix enables. We will also provide a short tutorial for developers to join the area.
Read more
Scaled and high quality data is the oil driving progress of AGI in research and development. Thanks to the foundational works such as Ray, Slurm, and vLLM, it becomes a lot easier to manage compute resources at scale and access to a diverse set of SOTA LLMs. However, these efforts are often designed for experienced engineers with entry barriers for researchers to unleash their full potential. Thus, in Fundamental AI Research Lab (FAIR) at Meta, we have built Matrix, a reliable framework for data-centric experimentation at scale, to connect these foundational pieces for researchers to quickly iterate on their ideas and build experiments with large-scale models and data.
Matrix supports robust and auto-scaled data generation from LLMs, game engine, and physics or world model simulators, with one command. It also offers easy setup for scalable data processing and augmentation such as LLM-as-a-judge in batch, safe code execution for verification, and data dedup, classification, or clustering. The framework also offers efficient and reproducible evaluation pipelines for large teams to collaborate on.
Matrix is widely used to empower Meta’s research and production bets in AGI including MLLMs and world modeling. In this session, we will introduce the Matrix framework from its design and synergy with other industry initiatives like Ray and vLLM, to the research and production use cases Matrix enables. We will also provide a short tutorial for developers to join the area.
Matrix supports robust and auto-scaled data generation from LLMs, game engine, and physics or world model simulators, with one command. It also offers easy setup for scalable data processing and augmentation such as LLM-as-a-judge in batch, safe code execution for verification, and data dedup, classification, or clustering. The framework also offers efficient and reproducible evaluation pipelines for large teams to collaborate on.
Matrix is widely used to empower Meta’s research and production bets in AGI including MLLMs and world modeling. In this session, we will introduce the Matrix framework from its design and synergy with other industry initiatives like Ray and vLLM, to the research and production use cases Matrix enables. We will also provide a short tutorial for developers to join the area.
Machine Learning
Structured Data
Breaking the Dataset Iteration Bottleneck: Real-Time ML Experimentation with Ray
At Pinterest, iterating on dataset curation and label generation consistently improves our recommendation models, but this process is severely constrained by expensive and time-consuming data generation workflows. When experimenting with new sampling strategies, features, or labels, teams face a critical choice: either backfill long-running jobs that strain compute resources and budget, or wait weeks for experimental datasets to naturally populate with new data. This creates a fundamental barrier to data-driven model improvement, where a single dataset iteration either costs thousands of dollars and requires a tedious monitoring of the backfill process, or takes weeks of waiting. In either case, the developer velocity is severely impacted.
Two pivotal use-cases within Pinterest exemplify these challenges, namely the Dataset Sampling Strategy Exploration and the Generation of Labels for Downstream Engagement Modeling. Sampling is fundamental for creating training datasets from massive data repositories. Our sampling strategy determines the composition and quality of resulting datasets, thus the resulting model, yet iterating on these strategies is prohibitively difficult and expensive. The current data generation workflows also prevent adoption of sophisticated techniques like score-based negative sampling that requires real-time computation during training. Downstream engagement labels present a similarly complex challenge. Unlike immediate action labels, these labels focus on driving long-term user engagement rather than instant responses. The complexity increases because each label involves multiple tunable hyperparameters (e.g. engagement decay) creating a vast search space. In both cases, teams would ideally conduct hyperparameter tuning to systematically explore these vast search spaces and identify optimal configurations, but the current data pipeline architecture makes such comprehensive exploration prohibitively expensive and time-consuming.
To address these limitations, we shifted both use-cases from static dataset generation to a streaming paradigm using Ray that enables truly iterative experimentation, moving sampling and label generation logic directly into the training dataloader to process data in real-time. This eliminates the costly choice between expensive backfills and weeks of waiting, while enabling comprehensive hyperparameter exploration. The impact spans both domains: sampling changes on ranking models now ship with 10x faster development time, while downstream engagement label experimentation has been reduced from 6 weeks to 3 days and adopted by multiple teams. The solution's power is fully realized during productization, where teams must simultaneously optimize both label generation parameters and sampling strategies - our unified approach handles both seamlessly within the same pipeline. Combined with Ray's bucket join capabilities that enable joining large embedding features and multiday datasets previously impossible due to cost and compute constraints, this has saved over hundreds of thousands dollars in costs while transforming dataset iteration from a fundamental bottleneck into an enabler of rapid experimentation.
Read more
At Pinterest, iterating on dataset curation and label generation consistently improves our recommendation models, but this process is severely constrained by expensive and time-consuming data generation workflows. When experimenting with new sampling strategies, features, or labels, teams face a critical choice: either backfill long-running jobs that strain compute resources and budget, or wait weeks for experimental datasets to naturally populate with new data. This creates a fundamental barrier to data-driven model improvement, where a single dataset iteration either costs thousands of dollars and requires a tedious monitoring of the backfill process, or takes weeks of waiting. In either case, the developer velocity is severely impacted.
Two pivotal use-cases within Pinterest exemplify these challenges, namely the Dataset Sampling Strategy Exploration and the Generation of Labels for Downstream Engagement Modeling. Sampling is fundamental for creating training datasets from massive data repositories. Our sampling strategy determines the composition and quality of resulting datasets, thus the resulting model, yet iterating on these strategies is prohibitively difficult and expensive. The current data generation workflows also prevent adoption of sophisticated techniques like score-based negative sampling that requires real-time computation during training. Downstream engagement labels present a similarly complex challenge. Unlike immediate action labels, these labels focus on driving long-term user engagement rather than instant responses. The complexity increases because each label involves multiple tunable hyperparameters (e.g. engagement decay) creating a vast search space. In both cases, teams would ideally conduct hyperparameter tuning to systematically explore these vast search spaces and identify optimal configurations, but the current data pipeline architecture makes such comprehensive exploration prohibitively expensive and time-consuming.
To address these limitations, we shifted both use-cases from static dataset generation to a streaming paradigm using Ray that enables truly iterative experimentation, moving sampling and label generation logic directly into the training dataloader to process data in real-time. This eliminates the costly choice between expensive backfills and weeks of waiting, while enabling comprehensive hyperparameter exploration. The impact spans both domains: sampling changes on ranking models now ship with 10x faster development time, while downstream engagement label experimentation has been reduced from 6 weeks to 3 days and adopted by multiple teams. The solution's power is fully realized during productization, where teams must simultaneously optimize both label generation parameters and sampling strategies - our unified approach handles both seamlessly within the same pipeline. Combined with Ray's bucket join capabilities that enable joining large embedding features and multiday datasets previously impossible due to cost and compute constraints, this has saved over hundreds of thousands dollars in costs while transforming dataset iteration from a fundamental bottleneck into an enabler of rapid experimentation.
Two pivotal use-cases within Pinterest exemplify these challenges, namely the Dataset Sampling Strategy Exploration and the Generation of Labels for Downstream Engagement Modeling. Sampling is fundamental for creating training datasets from massive data repositories. Our sampling strategy determines the composition and quality of resulting datasets, thus the resulting model, yet iterating on these strategies is prohibitively difficult and expensive. The current data generation workflows also prevent adoption of sophisticated techniques like score-based negative sampling that requires real-time computation during training. Downstream engagement labels present a similarly complex challenge. Unlike immediate action labels, these labels focus on driving long-term user engagement rather than instant responses. The complexity increases because each label involves multiple tunable hyperparameters (e.g. engagement decay) creating a vast search space. In both cases, teams would ideally conduct hyperparameter tuning to systematically explore these vast search spaces and identify optimal configurations, but the current data pipeline architecture makes such comprehensive exploration prohibitively expensive and time-consuming.
To address these limitations, we shifted both use-cases from static dataset generation to a streaming paradigm using Ray that enables truly iterative experimentation, moving sampling and label generation logic directly into the training dataloader to process data in real-time. This eliminates the costly choice between expensive backfills and weeks of waiting, while enabling comprehensive hyperparameter exploration. The impact spans both domains: sampling changes on ranking models now ship with 10x faster development time, while downstream engagement label experimentation has been reduced from 6 weeks to 3 days and adopted by multiple teams. The solution's power is fully realized during productization, where teams must simultaneously optimize both label generation parameters and sampling strategies - our unified approach handles both seamlessly within the same pipeline. Combined with Ray's bucket join capabilities that enable joining large embedding features and multiday datasets previously impossible due to cost and compute constraints, this has saved over hundreds of thousands dollars in costs while transforming dataset iteration from a fundamental bottleneck into an enabler of rapid experimentation.
Machine Learning
Text / Docs
Image
Structured Data
Super App, Super Powers: How a Small Team Built Grab's Multi-Modal Foundation Model with Ray
At Grab, Southeast Asia's leading super app, a single user journey is a rich, multi-modal story. To understand our users holistically, we needed to learn from the complex web of their interactions across our diverse services. Our goal was to build a powerful user embedding foundation model that could capture this complete view, enhancing numerous downstream models and personalizing the user experience.
The core output of this model is a set of powerful, general-purpose user embeddings. These numerical representations act as a universal feature set, designed to fuel a wide array of downstream applications and eliminate the need for siloed, hand-engineered features for each task.
However, off-the-shelf models could not comprehend our unique data ecosystem—a complex blend of long-term tabular profiles and short-term sequential interactions. This forced us to develop a custom transformer architecture that unifies these diverse data types using a novel key-value tokenization strategy and modality-specific adapters.
Training this model on terabytes of data and generating millions of embeddings daily presented a significant scaling challenge, especially for a small team. We overcame this by leveraging ray to build an efficient, distributed computing pipeline.
Today, our pre-trained embeddings are a critical component for systems across the company. They are actively powering a wide range of applications, including Churn Prediction, Ads Optimization, and Dual App Detection, creating millions of fresh embeddings daily for our users, merchants, and drivers.
Read more
At Grab, Southeast Asia's leading super app, a single user journey is a rich, multi-modal story. To understand our users holistically, we needed to learn from the complex web of their interactions across our diverse services. Our goal was to build a powerful user embedding foundation model that could capture this complete view, enhancing numerous downstream models and personalizing the user experience.
The core output of this model is a set of powerful, general-purpose user embeddings. These numerical representations act as a universal feature set, designed to fuel a wide array of downstream applications and eliminate the need for siloed, hand-engineered features for each task.
However, off-the-shelf models could not comprehend our unique data ecosystem—a complex blend of long-term tabular profiles and short-term sequential interactions. This forced us to develop a custom transformer architecture that unifies these diverse data types using a novel key-value tokenization strategy and modality-specific adapters.
Training this model on terabytes of data and generating millions of embeddings daily presented a significant scaling challenge, especially for a small team. We overcame this by leveraging ray to build an efficient, distributed computing pipeline.
Today, our pre-trained embeddings are a critical component for systems across the company. They are actively powering a wide range of applications, including Churn Prediction, Ads Optimization, and Dual App Detection, creating millions of fresh embeddings daily for our users, merchants, and drivers.
The core output of this model is a set of powerful, general-purpose user embeddings. These numerical representations act as a universal feature set, designed to fuel a wide array of downstream applications and eliminate the need for siloed, hand-engineered features for each task.
However, off-the-shelf models could not comprehend our unique data ecosystem—a complex blend of long-term tabular profiles and short-term sequential interactions. This forced us to develop a custom transformer architecture that unifies these diverse data types using a novel key-value tokenization strategy and modality-specific adapters.
Training this model on terabytes of data and generating millions of embeddings daily presented a significant scaling challenge, especially for a small team. We overcame this by leveraging ray to build an efficient, distributed computing pipeline.
Today, our pre-trained embeddings are a critical component for systems across the company. They are actively powering a wide range of applications, including Churn Prediction, Ads Optimization, and Dual App Detection, creating millions of fresh embeddings daily for our users, merchants, and drivers.
Machine Learning
Media & Gaming
Text / Docs
Image
JIT-Embedding with Ray Serve: Accelerating Large-Scale GenAI Foundation Model Training in Adobe Firefly
This presentation introduces JIT-Embedding (Just-in-Time Embedding), a novel solution designed to accelerate the training of foundational Generative AI (GenAI) models, with a focus on image and video generation in Adobe Firefly. By decoupling the expensive embedding computation from model training, JIT-Embedding enables these processes to scale independently. Built on Ray Serve, our architecture includes a robust JIT Service and JIT Client, seamlessly integrated with our Model Hub and Dataloader. The experiment results demonstrate that this approach significantly improved scalability, enabled higher-resolution and larger-scale GenAI Foundation Model training, and achieved notable performance gains and cost reductions. It's one of the innovations contributing to Firefly Video Model public release.
JIT-Embedding addresses several key challenges in large-scale foundation diffusion model training:
1. Slow on-the-fly embedding computation during training (e.g., VAE, CLIP, and T5 embeddings).
2. Long turnaround time required for offline embedding pre-computation.
3. High cost associated with recomputing embeddings using either approach.
4. Severe GPU memory constraints when training large models or processing high-resolution images/videos.
Our solution introduces several innovations to mitigate these issues:
1. JIT Service via Ray Serve: Wraps embedding computation as an on-demand service, deployable on underutilized lower-tier GPUs (e.g. A100), freeing up high-end GPUs (H100) for model training and optimizing resource allocation. The GPU memory requirements drop significantly for both side.
2. JIT Client with Dataloader Integration: Uses multiprocessing and prefetching to overlap embedding requests with training, effectively hiding latency of on-the-fly embedding computation and maximizing GPU utilization.
3. Efficient Serialization/Deserialization: We created a Rust + Python library, inspired by functional programming, to efficiently compresses multimodal data (e.g., images, videos, long text) to improve server–client communication throughput and flexibility.
4. Advanced Performance Optimization: Combines Ray Serve’s dashboards with our custom metrics, profiling, and load testing tools. We leverage advanced Ray features such as autoscaling, dynamic batching, and in-place model updates. Moreover, we came up with some key optimizations include client-side load balancing, faster video/image codecs in rust, overlapping CPU/GPU ops, and shared GPU usage across multiple models.
5. JIT Cache: Automatically stores computed embeddings for reuse across future training jobs, further reducing cost and computation time.
We plan to open source the JIT-Embedding solution, including the services, clients and Serialization/Deserialization library.
This talk will provide a comprehensive overview of the JIT-Embedding architecture, including the design of the JIT Service, JIT Client, Serialization/Deserialization, and the caching mechanism. We will present end-to-end experimental results from large-scale model training, showcasing the system’s scalability, performance enhancements, and cost efficiency. The session will conclude with key takeaways from our journey with Ray Serve and future directions for continued optimization.
Read more
This presentation introduces JIT-Embedding (Just-in-Time Embedding), a novel solution designed to accelerate the training of foundational Generative AI (GenAI) models, with a focus on image and video generation in Adobe Firefly. By decoupling the expensive embedding computation from model training, JIT-Embedding enables these processes to scale independently. Built on Ray Serve, our architecture includes a robust JIT Service and JIT Client, seamlessly integrated with our Model Hub and Dataloader. The experiment results demonstrate that this approach significantly improved scalability, enabled higher-resolution and larger-scale GenAI Foundation Model training, and achieved notable performance gains and cost reductions. It's one of the innovations contributing to Firefly Video Model public release.
JIT-Embedding addresses several key challenges in large-scale foundation diffusion model training:
1. Slow on-the-fly embedding computation during training (e.g., VAE, CLIP, and T5 embeddings).
2. Long turnaround time required for offline embedding pre-computation.
3. High cost associated with recomputing embeddings using either approach.
4. Severe GPU memory constraints when training large models or processing high-resolution images/videos.
Our solution introduces several innovations to mitigate these issues:
1. JIT Service via Ray Serve: Wraps embedding computation as an on-demand service, deployable on underutilized lower-tier GPUs (e.g. A100), freeing up high-end GPUs (H100) for model training and optimizing resource allocation. The GPU memory requirements drop significantly for both side.
2. JIT Client with Dataloader Integration: Uses multiprocessing and prefetching to overlap embedding requests with training, effectively hiding latency of on-the-fly embedding computation and maximizing GPU utilization.
3. Efficient Serialization/Deserialization: We created a Rust + Python library, inspired by functional programming, to efficiently compresses multimodal data (e.g., images, videos, long text) to improve server–client communication throughput and flexibility.
4. Advanced Performance Optimization: Combines Ray Serve’s dashboards with our custom metrics, profiling, and load testing tools. We leverage advanced Ray features such as autoscaling, dynamic batching, and in-place model updates. Moreover, we came up with some key optimizations include client-side load balancing, faster video/image codecs in rust, overlapping CPU/GPU ops, and shared GPU usage across multiple models.
5. JIT Cache: Automatically stores computed embeddings for reuse across future training jobs, further reducing cost and computation time.
We plan to open source the JIT-Embedding solution, including the services, clients and Serialization/Deserialization library.
This talk will provide a comprehensive overview of the JIT-Embedding architecture, including the design of the JIT Service, JIT Client, Serialization/Deserialization, and the caching mechanism. We will present end-to-end experimental results from large-scale model training, showcasing the system’s scalability, performance enhancements, and cost efficiency. The session will conclude with key takeaways from our journey with Ray Serve and future directions for continued optimization.
JIT-Embedding addresses several key challenges in large-scale foundation diffusion model training:
1. Slow on-the-fly embedding computation during training (e.g., VAE, CLIP, and T5 embeddings).
2. Long turnaround time required for offline embedding pre-computation.
3. High cost associated with recomputing embeddings using either approach.
4. Severe GPU memory constraints when training large models or processing high-resolution images/videos.
Our solution introduces several innovations to mitigate these issues:
1. JIT Service via Ray Serve: Wraps embedding computation as an on-demand service, deployable on underutilized lower-tier GPUs (e.g. A100), freeing up high-end GPUs (H100) for model training and optimizing resource allocation. The GPU memory requirements drop significantly for both side.
2. JIT Client with Dataloader Integration: Uses multiprocessing and prefetching to overlap embedding requests with training, effectively hiding latency of on-the-fly embedding computation and maximizing GPU utilization.
3. Efficient Serialization/Deserialization: We created a Rust + Python library, inspired by functional programming, to efficiently compresses multimodal data (e.g., images, videos, long text) to improve server–client communication throughput and flexibility.
4. Advanced Performance Optimization: Combines Ray Serve’s dashboards with our custom metrics, profiling, and load testing tools. We leverage advanced Ray features such as autoscaling, dynamic batching, and in-place model updates. Moreover, we came up with some key optimizations include client-side load balancing, faster video/image codecs in rust, overlapping CPU/GPU ops, and shared GPU usage across multiple models.
5. JIT Cache: Automatically stores computed embeddings for reuse across future training jobs, further reducing cost and computation time.
We plan to open source the JIT-Embedding solution, including the services, clients and Serialization/Deserialization library.
This talk will provide a comprehensive overview of the JIT-Embedding architecture, including the design of the JIT Service, JIT Client, Serialization/Deserialization, and the caching mechanism. We will present end-to-end experimental results from large-scale model training, showcasing the system’s scalability, performance enhancements, and cost efficiency. The session will conclude with key takeaways from our journey with Ray Serve and future directions for continued optimization.
Lightning Talk
Text / Docs
Powering the Future of LLMs: AWS and the vLLM open source project
Amazon is a strong supporter and contributor to vLLM, the leading open source inference engine for serving LLM. vLLM is used across Amazon and enables millions of customers to use the Amazon Rufus shopping assistant. vLLM's support for heterogeneous hardware, including AWS Trainium and NVIDIA GPUs, has enabled deployment of a cost-optimized, multi-node inference architecture. This hybrid approach allows us to route requests to the most appropriate accelerator, leading to infrastructure cost savings without compromising performance. In this session, we'll dive into AWS deployment options with vLLM, our existing open source work streams, and other initiatives.
Read more
Amazon is a strong supporter and contributor to vLLM, the leading open source inference engine for serving LLM. vLLM is used across Amazon and enables millions of customers to use the Amazon Rufus shopping assistant. vLLM's support for heterogeneous hardware, including AWS Trainium and NVIDIA GPUs, has enabled deployment of a cost-optimized, multi-node inference architecture. This hybrid approach allows us to route requests to the most appropriate accelerator, leading to infrastructure cost savings without compromising performance. In this session, we'll dive into AWS deployment options with vLLM, our existing open source work streams, and other initiatives.
Read morevLLM
Text / Docs
Scaling LLM Inference with RayServe & vLLM: Building a Serverless Internal, Enterprise Model Hosting Platform
In this talk, we'll share how we built an internal, enterprise, serverless model hosting platform using RayServe and vLLM—powering fast, scalable LLM inference across teams. Drawing inspiration from best-in-class industry solutions, our platform empowers users to deploy and manage models through a streamlined, self-service interface. We’ll dive into the key capabilities we’ve layered on top, including challenges and solutions around multi-tenancy, auto-scaling, token-level budgeting, request observability, and fine-grained resource controls. Whether you're building for internal developers or external customers, this session will show how RayServe and vLLM can be combined to deliver reliable, production-grade model inference at scale.
Read more
In this talk, we'll share how we built an internal, enterprise, serverless model hosting platform using RayServe and vLLM—powering fast, scalable LLM inference across teams. Drawing inspiration from best-in-class industry solutions, our platform empowers users to deploy and manage models through a streamlined, self-service interface. We’ll dive into the key capabilities we’ve layered on top, including challenges and solutions around multi-tenancy, auto-scaling, token-level budgeting, request observability, and fine-grained resource controls. Whether you're building for internal developers or external customers, this session will show how RayServe and vLLM can be combined to deliver reliable, production-grade model inference at scale.
Read moreRay Deep Dives
Ray Direct Transport: Direct GPU-GPU communication in Ray Core
GPU workloads on Ray often hit a hidden bottleneck: every tensor passed between tasks takes a costly trip through CPU memory and serialization. Ray Direct Transport (RDT) is a new feature in Ray Core that eliminates this overhead by keeping GPU data on the device and transferring it directly between actors—no unnecessary copies, no serialization.
Powered by high-performance backends like NCCL, Gloo, and RDMA, RDT enables easy and efficient scaling of cutting-edge workloads like reinforcement learning for LLMs and disaggregated multimodal training. In this talk, we’ll show how RDT integrates seamlessly with the familiar Ray ObjectRef API, the architecture behind RDT, and demonstrate how it unlocks fast and flexible distributed GPU programming.
Read more
GPU workloads on Ray often hit a hidden bottleneck: every tensor passed between tasks takes a costly trip through CPU memory and serialization. Ray Direct Transport (RDT) is a new feature in Ray Core that eliminates this overhead by keeping GPU data on the device and transferring it directly between actors—no unnecessary copies, no serialization.
Powered by high-performance backends like NCCL, Gloo, and RDMA, RDT enables easy and efficient scaling of cutting-edge workloads like reinforcement learning for LLMs and disaggregated multimodal training. In this talk, we’ll show how RDT integrates seamlessly with the familiar Ray ObjectRef API, the architecture behind RDT, and demonstrate how it unlocks fast and flexible distributed GPU programming.
Powered by high-performance backends like NCCL, Gloo, and RDMA, RDT enables easy and efficient scaling of cutting-edge workloads like reinforcement learning for LLMs and disaggregated multimodal training. In this talk, we’ll show how RDT integrates seamlessly with the familiar Ray ObjectRef API, the architecture behind RDT, and demonstrate how it unlocks fast and flexible distributed GPU programming.
Lightning Talk
Text / Docs
Structured Data
Synthetic data generation with ray data + serve + vLLM
This talk covers design patterns and considerations for combining ray data + serve + vLLM to construct scalable, high throughput pipelines for synthetic data generation. As an illustrative example, we implement a two-agent self-refinement loop using ray.serve + vLLM and integrate it into a ray.data pipeline.
Read more
This talk covers design patterns and considerations for combining ray data + serve + vLLM to construct scalable, high throughput pipelines for synthetic data generation. As an illustrative example, we implement a two-agent self-refinement loop using ray.serve + vLLM and integrate it into a ray.data pipeline.
Read moreMachine Learning
Media & Gaming
Structured Data
Mako: Netflix's Next Generation ML Training Platform
At Netflix, we are building Mako, a new ML training platform designed to meet the demands of modern AI workloads. In this talk, we will share how we evolved our training platform, improved GPU efficiency using a custom scheduler, and made key architecture changes to support large-scale training. We will also cover how Ray fits into this journey and what we learned along the way.
Read more
At Netflix, we are building Mako, a new ML training platform designed to meet the demands of modern AI workloads. In this talk, we will share how we evolved our training platform, improved GPU efficiency using a custom scheduler, and made key architecture changes to support large-scale training. We will also cover how Ray fits into this journey and what we learned along the way.
Read moreMachine Learning
Physical AI
Video
Optimizing Video AI at Scale: Cost-Effective ML Operations with Geotab and Anyscale Ray
Processing and deriving intelligence from billions of frames of video data captured by Geotab cameras can be a resource-intensive task. This presentation will share Geotab's journey of building a cost-efficient and highly automated Smart Video Platform utilizing Anyscale Ray.
We will showcase how Ray serves as the backbone for hosting and orchestrating our machine learning models for video analysis, enabling both efficient real-time inference and batch processing.
A key focus will be on our automated training and validation workflows, which leverage Ray's distributed capabilities to dramatically reduce the time and cost associated with model development and deployment. Learn how Geotab is achieving significant operational savings and accelerating innovation in video analytics through a strategic embrace of Anyscale Ray.
Read more
Processing and deriving intelligence from billions of frames of video data captured by Geotab cameras can be a resource-intensive task. This presentation will share Geotab's journey of building a cost-efficient and highly automated Smart Video Platform utilizing Anyscale Ray.
We will showcase how Ray serves as the backbone for hosting and orchestrating our machine learning models for video analysis, enabling both efficient real-time inference and batch processing.
A key focus will be on our automated training and validation workflows, which leverage Ray's distributed capabilities to dramatically reduce the time and cost associated with model development and deployment. Learn how Geotab is achieving significant operational savings and accelerating innovation in video analytics through a strategic embrace of Anyscale Ray.
We will showcase how Ray serves as the backbone for hosting and orchestrating our machine learning models for video analysis, enabling both efficient real-time inference and batch processing.
A key focus will be on our automated training and validation workflows, which leverage Ray's distributed capabilities to dramatically reduce the time and cost associated with model development and deployment. Learn how Geotab is achieving significant operational savings and accelerating innovation in video analytics through a strategic embrace of Anyscale Ray.
Lightning Talk
Structured Data
How Ray Is Powering The Open-source Data Ecosystem
Need to process 99% of US precedential caselaw in under $1? Scale distributed video processing to hundreds of terabytes of Creative Commons videos? Seamlessly churn through petabytes of text and images? Learn more about how Ray and Anyscale are the foundation of the open-source data engineering ecosystem.
Read more
Need to process 99% of US precedential caselaw in under $1? Scale distributed video processing to hundreds of terabytes of Creative Commons videos? Seamlessly churn through petabytes of text and images? Learn more about how Ray and Anyscale are the foundation of the open-source data engineering ecosystem.
Read moreLightning Talk
Text / Docs
vLLM with the Transformers backend: One model definition to rule them all
What if you could use the same model implementation for training and inference?
With the Transformers backend for vLLM, that’s now a reality. vLLM can run any Transformers-compatible model (merged into Transformers or custom!) directly from its original definition at vLLM speeds! Your modeling code becomes the single source of truth, while vLLM handles high-performance inference, attention optimizations, and scaling.
In this session we’ll take a whirlwind tour of: what the Transformers backend is, how to make a model compatible, and how this enables teams to use the same modeling codebase for research and production.
Read more
What if you could use the same model implementation for training and inference?
With the Transformers backend for vLLM, that’s now a reality. vLLM can run any Transformers-compatible model (merged into Transformers or custom!) directly from its original definition at vLLM speeds! Your modeling code becomes the single source of truth, while vLLM handles high-performance inference, attention optimizations, and scaling.
In this session we’ll take a whirlwind tour of: what the Transformers backend is, how to make a model compatible, and how this enables teams to use the same modeling codebase for research and production.
With the Transformers backend for vLLM, that’s now a reality. vLLM can run any Transformers-compatible model (merged into Transformers or custom!) directly from its original definition at vLLM speeds! Your modeling code becomes the single source of truth, while vLLM handles high-performance inference, attention optimizations, and scaling.
In this session we’ll take a whirlwind tour of: what the Transformers backend is, how to make a model compatible, and how this enables teams to use the same modeling codebase for research and production.
Machine Learning
Finance
Structured Data
Building a Model Fitting Framework for Quant Finance with Ray & Anyscale
Quant trading and research teams at Point72/Cubist have diverse needs related to data, models, and their specific use cases. Investing in an on-premise Ray cluster enabled Ray-focused approaches, but adoption has not always been seamless. Challenges emerged around data management (loading, reuse, access), scaling (efficiently performing parallel windowed model training, sometimes on tens of terrabytes of timeseries data), and platform usage (determining how and when to utilize an Anyscale cluster).
The Cubist Core Research Technologies team has developed libraries to tackle these challenges. We'll discuss patterns we found to be successful, those that fell short, and share insights that we believe can be widely applied.
Read more
Quant trading and research teams at Point72/Cubist have diverse needs related to data, models, and their specific use cases. Investing in an on-premise Ray cluster enabled Ray-focused approaches, but adoption has not always been seamless. Challenges emerged around data management (loading, reuse, access), scaling (efficiently performing parallel windowed model training, sometimes on tens of terrabytes of timeseries data), and platform usage (determining how and when to utilize an Anyscale cluster).
The Cubist Core Research Technologies team has developed libraries to tackle these challenges. We'll discuss patterns we found to be successful, those that fell short, and share insights that we believe can be widely applied.
The Cubist Core Research Technologies team has developed libraries to tackle these challenges. We'll discuss patterns we found to be successful, those that fell short, and share insights that we believe can be widely applied.
Machine Learning
Image
Video
Scaling Image and Video Processing with Ray
Scaling Image and Video Processing with Ray
Read more
Scaling Image and Video Processing with Ray
Read moreEngineering Lessons from Scaling Synthetic Data for Trillion-Scale Pretraining with KubeRay + vLLM
As language models continue to grow in capability and scale, high-quality training data has become a critical bottleneck. Synthetic data generation has emerged as a core technique for creating diverse, targeted datasets that complement organic sources and power models from lightning-fast 4.5B-parameter systems to frontier models like GPT-5.
We will share engineering lessons from building and scaling a production platform that processes trillions of tokens using KubeRay and vLLM, dynamically orchestrating thousands of GPU workers across multimodal recaptioning, rephrasing, and domain-specific content generation tasks. Topics include pushing vLLM inference to near-peak GPU utilization, designing fault-tolerant Ray actors for tensor-parallel sharding, auto-scaling KubeRay clusters to match workload patterns, and applying storage and scheduling strategies that deliver both high performance and significant cost efficiency.
We will highlight practical patterns for building resilient, scalable ML infrastructure. Attendees will learn how clean abstractions between Ray's distributed computing layer and vLLM's inference engine enable rapid iteration on prompt engineering while maintaining production stability, accelerating the journey from research prototypes to trillion-token datasets that define the next generation of AI capabilities.
Read more
As language models continue to grow in capability and scale, high-quality training data has become a critical bottleneck. Synthetic data generation has emerged as a core technique for creating diverse, targeted datasets that complement organic sources and power models from lightning-fast 4.5B-parameter systems to frontier models like GPT-5.
We will share engineering lessons from building and scaling a production platform that processes trillions of tokens using KubeRay and vLLM, dynamically orchestrating thousands of GPU workers across multimodal recaptioning, rephrasing, and domain-specific content generation tasks. Topics include pushing vLLM inference to near-peak GPU utilization, designing fault-tolerant Ray actors for tensor-parallel sharding, auto-scaling KubeRay clusters to match workload patterns, and applying storage and scheduling strategies that deliver both high performance and significant cost efficiency.
We will highlight practical patterns for building resilient, scalable ML infrastructure. Attendees will learn how clean abstractions between Ray's distributed computing layer and vLLM's inference engine enable rapid iteration on prompt engineering while maintaining production stability, accelerating the journey from research prototypes to trillion-token datasets that define the next generation of AI capabilities.
We will share engineering lessons from building and scaling a production platform that processes trillions of tokens using KubeRay and vLLM, dynamically orchestrating thousands of GPU workers across multimodal recaptioning, rephrasing, and domain-specific content generation tasks. Topics include pushing vLLM inference to near-peak GPU utilization, designing fault-tolerant Ray actors for tensor-parallel sharding, auto-scaling KubeRay clusters to match workload patterns, and applying storage and scheduling strategies that deliver both high performance and significant cost efficiency.
We will highlight practical patterns for building resilient, scalable ML infrastructure. Attendees will learn how clean abstractions between Ray's distributed computing layer and vLLM's inference engine enable rapid iteration on prompt engineering while maintaining production stability, accelerating the journey from research prototypes to trillion-token datasets that define the next generation of AI capabilities.
Machine Learning
Structured Data
Building RayLab: Autodesk’s Journey to Scalable Deep Learning Infrastructure
In this presentation, we describe Autodesk's journey to enabling large-scale deep learning across the company. We began by exploring managed solutions like AWS Batch and SageMaker, but quickly ran into challenges around scalability, customization, networking, and developer experience. To overcome these limitations, we turned to Ray and KubeRay, which offered the flexibility and control we needed. Building on top of these technologies, we developed RayLab - Autodesk's internal platform for scalable training, data processing, and model serving.
RayLab is a Kubernetes-native platform that abstracts away infrastructure complexity while supporting secure, efficient, and user-friendly workflows. We built wrappers for Ray cluster management via a CLI, Web UI, and Python SDK to simplify usage, reduce onboarding friction, and ensure compliance with Autodesk's internal security and networking requirements. We'll describe the architecture behind RayLab, which includes Kubernetes, KubeRay, Karpenter, Grafana, and JupyterHub - all secured with role-based access control and designed for multi-tenancy to support ML workspaces across teams.
RayLab provides high-level APIs built on Ray and PyTorch Lightning, allowing users to launch distributed training jobs with minimal code. It includes standardized checkpointing and experiment tracking to support reproducibility and consistent workflows.
We'll also share the challenges we faced in improving the efficiency of our reserved H100 GPU resources, particularly around fair sharing across teams and users. To address this, we implemented quota management and a priority-based scheduling system that enables high-priority jobs to preempt lower-priority ones, significantly increasing utilization. Additionally, RayLab supports automatic downscaling of underutilized clusters to conserve compute.
Finally, we'll conclude with a live demo of RayLab in action.
Read more
In this presentation, we describe Autodesk's journey to enabling large-scale deep learning across the company. We began by exploring managed solutions like AWS Batch and SageMaker, but quickly ran into challenges around scalability, customization, networking, and developer experience. To overcome these limitations, we turned to Ray and KubeRay, which offered the flexibility and control we needed. Building on top of these technologies, we developed RayLab - Autodesk's internal platform for scalable training, data processing, and model serving.
RayLab is a Kubernetes-native platform that abstracts away infrastructure complexity while supporting secure, efficient, and user-friendly workflows. We built wrappers for Ray cluster management via a CLI, Web UI, and Python SDK to simplify usage, reduce onboarding friction, and ensure compliance with Autodesk's internal security and networking requirements. We'll describe the architecture behind RayLab, which includes Kubernetes, KubeRay, Karpenter, Grafana, and JupyterHub - all secured with role-based access control and designed for multi-tenancy to support ML workspaces across teams.
RayLab provides high-level APIs built on Ray and PyTorch Lightning, allowing users to launch distributed training jobs with minimal code. It includes standardized checkpointing and experiment tracking to support reproducibility and consistent workflows.
We'll also share the challenges we faced in improving the efficiency of our reserved H100 GPU resources, particularly around fair sharing across teams and users. To address this, we implemented quota management and a priority-based scheduling system that enables high-priority jobs to preempt lower-priority ones, significantly increasing utilization. Additionally, RayLab supports automatic downscaling of underutilized clusters to conserve compute.
Finally, we'll conclude with a live demo of RayLab in action.
RayLab is a Kubernetes-native platform that abstracts away infrastructure complexity while supporting secure, efficient, and user-friendly workflows. We built wrappers for Ray cluster management via a CLI, Web UI, and Python SDK to simplify usage, reduce onboarding friction, and ensure compliance with Autodesk's internal security and networking requirements. We'll describe the architecture behind RayLab, which includes Kubernetes, KubeRay, Karpenter, Grafana, and JupyterHub - all secured with role-based access control and designed for multi-tenancy to support ML workspaces across teams.
RayLab provides high-level APIs built on Ray and PyTorch Lightning, allowing users to launch distributed training jobs with minimal code. It includes standardized checkpointing and experiment tracking to support reproducibility and consistent workflows.
We'll also share the challenges we faced in improving the efficiency of our reserved H100 GPU resources, particularly around fair sharing across teams and users. To address this, we implemented quota management and a priority-based scheduling system that enables high-priority jobs to preempt lower-priority ones, significantly increasing utilization. Additionally, RayLab supports automatic downscaling of underutilized clusters to conserve compute.
Finally, we'll conclude with a live demo of RayLab in action.
vLLM
Text / Docs
FlashInfer: Accelerating LLM Inference Through Unified High-Performance Kernels
TBC
Read more
TBC
Read moreRay Deep Dives
Advancing KubeRay: Deepening Ecosystem Integrations for Scalable AI Workloads
Improving user experience has been a central focus of the latest KubeRay developments. In this session, KubeRay maintainers will introduce major RayJob enhancements designed to make running and managing workloads simpler and more reliable, including deletion policies, cron scheduling, sidecar mode, and background status checks. These upgrades streamline the end-to-end job lifecycle for both developers and operators.
We will then explore new capabilities that elevate the KubeRay user experience across the Kubernetes ecosystem. Highlights include updates to the kubectl plugin for smoother user workflows, the redesigned APIServer V2 for a cleaner and more extensible control plane, and KubeRay Metrics to improve observability. We will also cover expanded Kubernetes third party scheduler supports, along with progress on in-place pod resizing integration.
Finally, we will share the roadmap for the upcoming history server, which will enable debugging for dead Ray clusters. Join us to learn how these enhancements are transforming the KubeRay experience—making it more powerful, cohesive, and user-friendly for running AI workloads on Kubernetes at scale.
Read more
Improving user experience has been a central focus of the latest KubeRay developments. In this session, KubeRay maintainers will introduce major RayJob enhancements designed to make running and managing workloads simpler and more reliable, including deletion policies, cron scheduling, sidecar mode, and background status checks. These upgrades streamline the end-to-end job lifecycle for both developers and operators.
We will then explore new capabilities that elevate the KubeRay user experience across the Kubernetes ecosystem. Highlights include updates to the kubectl plugin for smoother user workflows, the redesigned APIServer V2 for a cleaner and more extensible control plane, and KubeRay Metrics to improve observability. We will also cover expanded Kubernetes third party scheduler supports, along with progress on in-place pod resizing integration.
Finally, we will share the roadmap for the upcoming history server, which will enable debugging for dead Ray clusters. Join us to learn how these enhancements are transforming the KubeRay experience—making it more powerful, cohesive, and user-friendly for running AI workloads on Kubernetes at scale.
We will then explore new capabilities that elevate the KubeRay user experience across the Kubernetes ecosystem. Highlights include updates to the kubectl plugin for smoother user workflows, the redesigned APIServer V2 for a cleaner and more extensible control plane, and KubeRay Metrics to improve observability. We will also cover expanded Kubernetes third party scheduler supports, along with progress on in-place pod resizing integration.
Finally, we will share the roadmap for the upcoming history server, which will enable debugging for dead Ray clusters. Join us to learn how these enhancements are transforming the KubeRay experience—making it more powerful, cohesive, and user-friendly for running AI workloads on Kubernetes at scale.
Ray Deep Dives
Structured Data Support for Ray Data
Ray Data is a data processing engine built to target AI workloads. In the past year, we've added support for traditional tabular data workloads to Ray Data. In this talk, we'll discuss key new features like joins, expressions, and data preprocessors in Ray Data; discuss new architectural changes needed to support these features; and showcase performance improvements that we see across Ray versions from these changes. We'll end with a discussion of future roadmap and talk about upcoming challenges.
Read more
Ray Data is a data processing engine built to target AI workloads. In the past year, we've added support for traditional tabular data workloads to Ray Data. In this talk, we'll discuss key new features like joins, expressions, and data preprocessors in Ray Data; discuss new architectural changes needed to support these features; and showcase performance improvements that we see across Ray versions from these changes. We'll end with a discussion of future roadmap and talk about upcoming challenges.
Read moreLLMs
Research
Text / Docs
Marin: Open Development of Open Foundation Models
Open-source software thrives because its entire lifecycle remains public: code, tests, even missteps. Foundation models rarely meet that bar: most “open-weight” releases omit the training code, data recipe, and logs needed for reproducibility.
Marin closes that gap. Every run begins as a GitHub pull request that defines the hypothesis and pins the config. Ray orchestrates the job across preemptible Google Cloud TPUs, streaming metrics and depositing artifacts tied exactly to the commit. Successes, failures, and restarts remain visible, not hidden.
With this workflow we trained Marin-8B, which outperforms Llama 3.1 8B Base on 14 of 19 benchmarks. We will share lessons from scaling to 32B parameters, training MoEs on preemptible hardware, and building a Ray-based RL pipeline for agentic models, focusing on autoscaling, fault tolerance, and dataset pipelines. We will also highlight ways to get involved, from optimizers and data curation to following training logs live.
Read more
Open-source software thrives because its entire lifecycle remains public: code, tests, even missteps. Foundation models rarely meet that bar: most “open-weight” releases omit the training code, data recipe, and logs needed for reproducibility.
Marin closes that gap. Every run begins as a GitHub pull request that defines the hypothesis and pins the config. Ray orchestrates the job across preemptible Google Cloud TPUs, streaming metrics and depositing artifacts tied exactly to the commit. Successes, failures, and restarts remain visible, not hidden.
With this workflow we trained Marin-8B, which outperforms Llama 3.1 8B Base on 14 of 19 benchmarks. We will share lessons from scaling to 32B parameters, training MoEs on preemptible hardware, and building a Ray-based RL pipeline for agentic models, focusing on autoscaling, fault tolerance, and dataset pipelines. We will also highlight ways to get involved, from optimizers and data curation to following training logs live.
Marin closes that gap. Every run begins as a GitHub pull request that defines the hypothesis and pins the config. Ray orchestrates the job across preemptible Google Cloud TPUs, streaming metrics and depositing artifacts tied exactly to the commit. Successes, failures, and restarts remain visible, not hidden.
With this workflow we trained Marin-8B, which outperforms Llama 3.1 8B Base on 14 of 19 benchmarks. We will share lessons from scaling to 32B parameters, training MoEs on preemptible hardware, and building a Ray-based RL pipeline for agentic models, focusing on autoscaling, fault tolerance, and dataset pipelines. We will also highlight ways to get involved, from optimizers and data curation to following training logs live.
LLMs
Finance
Text / Docs
Structured Data
Operationalizing Ray for Real-Time Inference at Scale
The KubeRay project simplifies the deployment of Ray clusters on Kubernetes. However, hardening these clusters to meet stringent production non-functional requirements (NFRs) typical of regulated environments such as those in finance requires additional engineering effort. These NFRs include multi-region and multi-AZ resilience, deployments with zero downtime, proactive autoscaling, automated production validation testing, and robust integration with monitoring and logging systems. Achieving this requires strict adherence to integration patterns and architectural discipline. This work outlines the engineering patterns and platform enhancements we applied to deploy a real-time recommendation system on a Ray cluster on EKS in production.
The system architecture routes traffic through Route53 to NLB, ALB, and finally nginx ingress to reach Ray Serve deployments running in isolated Kubernetes namespaces. Ray Serve’s proxy actors distributed across worker pods act as distributed ingress routers for balancing traffic across replicas, buffering during surges, and enabling concurrent processing without blocking. Inference workloads run on dedicated Ray clusters on separate Kubernetes namespaces managed by the centralized KubeRay Operator. Compute and network isolation is ensured by network policies and RBAC.
Autoscaling is handled at three levels: Ray Serve scales replicas based on request queue depth, KubeRay Autoscaler adjusts Ray worker pod counts based on cluster metrics, and the AWS Cluster Autoscaler provisions EC2 instances based on pending pods awaiting compute resources. This ensures responsiveness during traffic spikes while avoiding over-provisioning.
To maximize GPU utilization and reduce latency, the platform leverages Ray Serve’s dynamic batching in combination with vLLM to perform batched LLM inference. This approach ensures high-throughput, low-latency processing, especially under variable request loads, by grouping requests at runtime based on traffic characteristics.
Observability is achieved through an integrated Prometheus and Grafana stack. A PodMonitor scrapes metrics from Ray components, which are then ingested by Prometheus and visualized in Grafana for real-time analysis and alerting. In parallel, a Fluentd DaemonSet captures logs from Ray and application pods, forwarding them to AWS CloudWatch. These logs are then ingested into Splunk for centralized search, monitoring, and audit compliance. The apps are also monitored at cluster level by Dynatrace and Datadog to further enhance observability and monitoring capabilities.
To enable robust and disruption-free deployments, the platform uses a blue-green deployment pipeline built with Spinnaker. The pipeline includes progressive rollout stages, automated validation, manual approval gates, and rollback paths.
This robust system demonstrates a scalable, resilient, and observability-driven approach to deploying real-time inference and LLM workloads on Ray. Attendees will gain valuable insights into end-to-end development and deployment architecture, GPU workload optimization, and operationalization of Ray Serve in production environments.
Read more
The KubeRay project simplifies the deployment of Ray clusters on Kubernetes. However, hardening these clusters to meet stringent production non-functional requirements (NFRs) typical of regulated environments such as those in finance requires additional engineering effort. These NFRs include multi-region and multi-AZ resilience, deployments with zero downtime, proactive autoscaling, automated production validation testing, and robust integration with monitoring and logging systems. Achieving this requires strict adherence to integration patterns and architectural discipline. This work outlines the engineering patterns and platform enhancements we applied to deploy a real-time recommendation system on a Ray cluster on EKS in production.
The system architecture routes traffic through Route53 to NLB, ALB, and finally nginx ingress to reach Ray Serve deployments running in isolated Kubernetes namespaces. Ray Serve’s proxy actors distributed across worker pods act as distributed ingress routers for balancing traffic across replicas, buffering during surges, and enabling concurrent processing without blocking. Inference workloads run on dedicated Ray clusters on separate Kubernetes namespaces managed by the centralized KubeRay Operator. Compute and network isolation is ensured by network policies and RBAC.
Autoscaling is handled at three levels: Ray Serve scales replicas based on request queue depth, KubeRay Autoscaler adjusts Ray worker pod counts based on cluster metrics, and the AWS Cluster Autoscaler provisions EC2 instances based on pending pods awaiting compute resources. This ensures responsiveness during traffic spikes while avoiding over-provisioning.
To maximize GPU utilization and reduce latency, the platform leverages Ray Serve’s dynamic batching in combination with vLLM to perform batched LLM inference. This approach ensures high-throughput, low-latency processing, especially under variable request loads, by grouping requests at runtime based on traffic characteristics.
Observability is achieved through an integrated Prometheus and Grafana stack. A PodMonitor scrapes metrics from Ray components, which are then ingested by Prometheus and visualized in Grafana for real-time analysis and alerting. In parallel, a Fluentd DaemonSet captures logs from Ray and application pods, forwarding them to AWS CloudWatch. These logs are then ingested into Splunk for centralized search, monitoring, and audit compliance. The apps are also monitored at cluster level by Dynatrace and Datadog to further enhance observability and monitoring capabilities.
To enable robust and disruption-free deployments, the platform uses a blue-green deployment pipeline built with Spinnaker. The pipeline includes progressive rollout stages, automated validation, manual approval gates, and rollback paths.
This robust system demonstrates a scalable, resilient, and observability-driven approach to deploying real-time inference and LLM workloads on Ray. Attendees will gain valuable insights into end-to-end development and deployment architecture, GPU workload optimization, and operationalization of Ray Serve in production environments.
The system architecture routes traffic through Route53 to NLB, ALB, and finally nginx ingress to reach Ray Serve deployments running in isolated Kubernetes namespaces. Ray Serve’s proxy actors distributed across worker pods act as distributed ingress routers for balancing traffic across replicas, buffering during surges, and enabling concurrent processing without blocking. Inference workloads run on dedicated Ray clusters on separate Kubernetes namespaces managed by the centralized KubeRay Operator. Compute and network isolation is ensured by network policies and RBAC.
Autoscaling is handled at three levels: Ray Serve scales replicas based on request queue depth, KubeRay Autoscaler adjusts Ray worker pod counts based on cluster metrics, and the AWS Cluster Autoscaler provisions EC2 instances based on pending pods awaiting compute resources. This ensures responsiveness during traffic spikes while avoiding over-provisioning.
To maximize GPU utilization and reduce latency, the platform leverages Ray Serve’s dynamic batching in combination with vLLM to perform batched LLM inference. This approach ensures high-throughput, low-latency processing, especially under variable request loads, by grouping requests at runtime based on traffic characteristics.
Observability is achieved through an integrated Prometheus and Grafana stack. A PodMonitor scrapes metrics from Ray components, which are then ingested by Prometheus and visualized in Grafana for real-time analysis and alerting. In parallel, a Fluentd DaemonSet captures logs from Ray and application pods, forwarding them to AWS CloudWatch. These logs are then ingested into Splunk for centralized search, monitoring, and audit compliance. The apps are also monitored at cluster level by Dynatrace and Datadog to further enhance observability and monitoring capabilities.
To enable robust and disruption-free deployments, the platform uses a blue-green deployment pipeline built with Spinnaker. The pipeline includes progressive rollout stages, automated validation, manual approval gates, and rollback paths.
This robust system demonstrates a scalable, resilient, and observability-driven approach to deploying real-time inference and LLM workloads on Ray. Attendees will gain valuable insights into end-to-end development and deployment architecture, GPU workload optimization, and operationalization of Ray Serve in production environments.
CoServe: Max performance, minimal compute
Cohere is committed to build a scalable and efficient all-in-one platform for private and secure AI solutions for enterprises. On top of the vLLM library, we combine accuracy-preserving low-bit quantization solutions with extensive kernel and data communication optimization to deliver high-performance low-latency inference services with minimal compute cost. For example, our foundation model Command A series can be served with single h100 GPU at low latency while supporting more than 128K context length.
Read more
Cohere is committed to build a scalable and efficient all-in-one platform for private and secure AI solutions for enterprises. On top of the vLLM library, we combine accuracy-preserving low-bit quantization solutions with extensive kernel and data communication optimization to deliver high-performance low-latency inference services with minimal compute cost. For example, our foundation model Command A series can be served with single h100 GPU at low latency while supporting more than 128K context length.
Read moreMachine Learning
Text / Docs
Structured Data
Revolutionizing Model Serving with a 50x Cost Reduction using Ray Serve at Workday
Workday uses a tenanted, regionalized architecture in order to ensure data isolation and in-region execution, both of which are crucial requirements for our customers. In early 2023, facing challenges with the ever-increasing scale and cost required to serve dedicated ML models for every tenant in every environment, we decided to completely redo how we serve models using a hot new technology: Ray! We now use Ray Serve to serve tens of thousands of ML models across more than a dozen environments. Ray Serve’s inherent capabilities of per-deployment autoscaling and efficient request routing have enabled 50x cost reductions compared to our previous systems while maintaining high availability and low latency.
But it wasn’t all smooth sailing: Ray Serve was not originally designed for some of the odd ways our system works, and we ran into scaling problems with only a few dozen applications deployed. Thankfully, Ray is open-source, and we were able to dramatically improve the scalability of Ray Serve through a series of contributions we made back to the community. Today, Ray Serve can handle thousands of applications per cluster.
In this talk, we’ll discuss how we use Ray Serve to implement model serving at Workday. We’ll dive deep into our off-the-beaten-path usage of Ray Serve, the challenges we faced with scaling up our system, and the improvements we’ve contributed back to Ray to solve them. You’ll come away from this talk with a deeper understanding of how Ray Serve works and how to build complex systems on top of it – and maybe even contribute back to it!
Read more
Workday uses a tenanted, regionalized architecture in order to ensure data isolation and in-region execution, both of which are crucial requirements for our customers. In early 2023, facing challenges with the ever-increasing scale and cost required to serve dedicated ML models for every tenant in every environment, we decided to completely redo how we serve models using a hot new technology: Ray! We now use Ray Serve to serve tens of thousands of ML models across more than a dozen environments. Ray Serve’s inherent capabilities of per-deployment autoscaling and efficient request routing have enabled 50x cost reductions compared to our previous systems while maintaining high availability and low latency.
But it wasn’t all smooth sailing: Ray Serve was not originally designed for some of the odd ways our system works, and we ran into scaling problems with only a few dozen applications deployed. Thankfully, Ray is open-source, and we were able to dramatically improve the scalability of Ray Serve through a series of contributions we made back to the community. Today, Ray Serve can handle thousands of applications per cluster.
In this talk, we’ll discuss how we use Ray Serve to implement model serving at Workday. We’ll dive deep into our off-the-beaten-path usage of Ray Serve, the challenges we faced with scaling up our system, and the improvements we’ve contributed back to Ray to solve them. You’ll come away from this talk with a deeper understanding of how Ray Serve works and how to build complex systems on top of it – and maybe even contribute back to it!
But it wasn’t all smooth sailing: Ray Serve was not originally designed for some of the odd ways our system works, and we ran into scaling problems with only a few dozen applications deployed. Thankfully, Ray is open-source, and we were able to dramatically improve the scalability of Ray Serve through a series of contributions we made back to the community. Today, Ray Serve can handle thousands of applications per cluster.
In this talk, we’ll discuss how we use Ray Serve to implement model serving at Workday. We’ll dive deep into our off-the-beaten-path usage of Ray Serve, the challenges we faced with scaling up our system, and the improvements we’ve contributed back to Ray to solve them. You’ll come away from this talk with a deeper understanding of how Ray Serve works and how to build complex systems on top of it – and maybe even contribute back to it!
Machine Learning
Text / Docs
Image
Structured Data
Accelerating AI Pipelines with AnalyticDB Ray: Alibaba Cloud's Approach to Data-AI Convergence
Abstract: In the era of data-driven innovation, efficiently processing and analyzing multi-modal data is crucial for building effective AI pipelines. This presentation will focus on real-world applications of Alibaba Cloud's AnalyticDB Ray, showcasing how its capabilities are leveraged within a data warehouse environment to accelerate AI initiatives.
We will delve into practical use cases that demonstrate the seamless integration of multi-modal ETL and machine learning:
1. Optimizing Advertising Recommendation Inference: Learn how AnalyticDB Ray is used for offline batch inference in advertising recommendations, specifically for estimating click-through rates (CTR). This includes details on how heterogeneous resources (CPU and GPU) are independently and automatically scaled to maximize GPU utilization, achieving an increase from less than 5% to 40%. We will also discuss the dynamic auto-scaling of object storage based on data volume, which has improved data processing performance by 2 to 3 times.
2. Accelerating Large Language Model (LLM) Offline Batch Inference and Data Distillation: Discover how AnalyticDB Ray facilitates large model data preparation. We will illustrate the use of Ray Data and vLLM/SGLang for data distillation with models like Qwen and Deepseek, which then fuels large model training. Key benefits include a 2-3x improvement in data loading throughput due to caching , scheduling of 40,000 fine-grained tasks within a single Ray cluster , and a 50% performance increase for Deepseek INT8 quantization compared to FP8 in offline distillation scenarios.
3. Efficient Distributed Fine-tuning of Multi-modal Models: Explore how AnalyticDB Ray, integrated with Lance, enhances distributed image-text data processing and structuring using RayData for multi-modal personalized interactive scenarios. We will also showcase the integration with LLaMA-Factory to provide distributed fine-tuning capabilities for Qwen-VL multi-modal models. This offers a one-stop solution from data labeling to model fine-tuning and has improved distributed fine-tuning efficiency by 3-5 times.
These examples will illustrate how AnalyticDB Ray unlocks the potential for in-warehouse AI pipelines, seamlessly integrating multi-modal ETL and machine learning to accelerate the journey from data to intelligent decision-making.
Read more
Abstract: In the era of data-driven innovation, efficiently processing and analyzing multi-modal data is crucial for building effective AI pipelines. This presentation will focus on real-world applications of Alibaba Cloud's AnalyticDB Ray, showcasing how its capabilities are leveraged within a data warehouse environment to accelerate AI initiatives.
We will delve into practical use cases that demonstrate the seamless integration of multi-modal ETL and machine learning:
1. Optimizing Advertising Recommendation Inference: Learn how AnalyticDB Ray is used for offline batch inference in advertising recommendations, specifically for estimating click-through rates (CTR). This includes details on how heterogeneous resources (CPU and GPU) are independently and automatically scaled to maximize GPU utilization, achieving an increase from less than 5% to 40%. We will also discuss the dynamic auto-scaling of object storage based on data volume, which has improved data processing performance by 2 to 3 times.
2. Accelerating Large Language Model (LLM) Offline Batch Inference and Data Distillation: Discover how AnalyticDB Ray facilitates large model data preparation. We will illustrate the use of Ray Data and vLLM/SGLang for data distillation with models like Qwen and Deepseek, which then fuels large model training. Key benefits include a 2-3x improvement in data loading throughput due to caching , scheduling of 40,000 fine-grained tasks within a single Ray cluster , and a 50% performance increase for Deepseek INT8 quantization compared to FP8 in offline distillation scenarios.
3. Efficient Distributed Fine-tuning of Multi-modal Models: Explore how AnalyticDB Ray, integrated with Lance, enhances distributed image-text data processing and structuring using RayData for multi-modal personalized interactive scenarios. We will also showcase the integration with LLaMA-Factory to provide distributed fine-tuning capabilities for Qwen-VL multi-modal models. This offers a one-stop solution from data labeling to model fine-tuning and has improved distributed fine-tuning efficiency by 3-5 times.
These examples will illustrate how AnalyticDB Ray unlocks the potential for in-warehouse AI pipelines, seamlessly integrating multi-modal ETL and machine learning to accelerate the journey from data to intelligent decision-making.
We will delve into practical use cases that demonstrate the seamless integration of multi-modal ETL and machine learning:
1. Optimizing Advertising Recommendation Inference: Learn how AnalyticDB Ray is used for offline batch inference in advertising recommendations, specifically for estimating click-through rates (CTR). This includes details on how heterogeneous resources (CPU and GPU) are independently and automatically scaled to maximize GPU utilization, achieving an increase from less than 5% to 40%. We will also discuss the dynamic auto-scaling of object storage based on data volume, which has improved data processing performance by 2 to 3 times.
2. Accelerating Large Language Model (LLM) Offline Batch Inference and Data Distillation: Discover how AnalyticDB Ray facilitates large model data preparation. We will illustrate the use of Ray Data and vLLM/SGLang for data distillation with models like Qwen and Deepseek, which then fuels large model training. Key benefits include a 2-3x improvement in data loading throughput due to caching , scheduling of 40,000 fine-grained tasks within a single Ray cluster , and a 50% performance increase for Deepseek INT8 quantization compared to FP8 in offline distillation scenarios.
3. Efficient Distributed Fine-tuning of Multi-modal Models: Explore how AnalyticDB Ray, integrated with Lance, enhances distributed image-text data processing and structuring using RayData for multi-modal personalized interactive scenarios. We will also showcase the integration with LLaMA-Factory to provide distributed fine-tuning capabilities for Qwen-VL multi-modal models. This offers a one-stop solution from data labeling to model fine-tuning and has improved distributed fine-tuning efficiency by 3-5 times.
These examples will illustrate how AnalyticDB Ray unlocks the potential for in-warehouse AI pipelines, seamlessly integrating multi-modal ETL and machine learning to accelerate the journey from data to intelligent decision-making.