RAY SUMMIT 2025
IN-PERSON AGENDA
08:30 AM - 09:30 AMRayground
Breakfast + Networking
Enjoy a light breakfast and coffee while mingling with Ray attendees
Read more
Enjoy a light breakfast and coffee while mingling with Ray attendees
Read more09:30 AM - 11:30 AMKeynote Session
Ray Summit Keynote - Day 1
Ray Summit is where AI engineers, researchers, and open-source contributors come together to shape the next generation of scalable AI systems. This year’s opening keynote dives into three themes: the evolution of the Ray ecosystem and its growing community, the expanding role of open source in defining the modern AI stack – from Kubernetes to Ray, PyTorch, and vLLM – and new capabilities in the Anyscale platform designed to help teams move faster towards production AI. Hear directly from leaders across Meta, NVIDIA, Thinking Machines, UC Berkeley, Azure and Anyscale as they share how open systems are redefining how we build and deploy AI at scale.
Day 1 Keynote Speakers:
Jim Fan, NVIDIA
Devendra Chaplot, Thinking Machines
Brendan Burn, Microsoft
Joe Spisak, Meta
Simon Mo, vLLM
Dawn Chen, Google
Ion Stoica, Anyscale
Robert Nishihara, Anyscale
Keerti Melkote, Anyscale
Read more
Ray Summit is where AI engineers, researchers, and open-source contributors come together to shape the next generation of scalable AI systems. This year’s opening keynote dives into three themes: the evolution of the Ray ecosystem and its growing community, the expanding role of open source in defining the modern AI stack – from Kubernetes to Ray, PyTorch, and vLLM – and new capabilities in the Anyscale platform designed to help teams move faster towards production AI. Hear directly from leaders across Meta, NVIDIA, Thinking Machines, UC Berkeley, Azure and Anyscale as they share how open systems are redefining how we build and deploy AI at scale.
Day 1 Keynote Speakers:
Jim Fan, NVIDIA
Devendra Chaplot, Thinking Machines
Brendan Burn, Microsoft
Joe Spisak, Meta
Simon Mo, vLLM
Dawn Chen, Google
Ion Stoica, Anyscale
Robert Nishihara, Anyscale
Keerti Melkote, Anyscale
Day 1 Keynote Speakers:
Jim Fan, NVIDIA
Devendra Chaplot, Thinking Machines
Brendan Burn, Microsoft
Joe Spisak, Meta
Simon Mo, vLLM
Dawn Chen, Google
Ion Stoica, Anyscale
Robert Nishihara, Anyscale
Keerti Melkote, Anyscale
11:30 AM - 01:00 PMRayground
Lunch + Networking
Grab lunch and explore Rayground, where you'll find Anyscale demos, sponsor booths, and the Lightning Theater.
Read more
Grab lunch and explore Rayground, where you'll find Anyscale demos, sponsor booths, and the Lightning Theater.
Read more11:45 AM - 12:00 PMLightning Theater
Agentic AI
Text / Docs
Ray Agent Engine : Agent Deployment using Ray Serve
The rise of sophisticated AI agents is driving a new wave of innovation, but deploying these agents reliably and at scale presents significant challenges. This talk explores how Ray Serve can be leveraged to deploy AI agents in a framework-agnostic manner, enabling seamless integration with various agent architectures and development workflows.
The attendees will share the learnings from leveraging Ray as an Agent Engine, demonstrating the diverse capabilities of Ray Serve– including its built-in autoscaling and traffic management – to optimize performance, ensure robustness, and simplify the management of complex agent deployments. Attendees will gain valuable insights into building scalable and resilient agent-powered applications using Ray Serve, regardless of the underlying framework or agent design.
Read more
The rise of sophisticated AI agents is driving a new wave of innovation, but deploying these agents reliably and at scale presents significant challenges. This talk explores how Ray Serve can be leveraged to deploy AI agents in a framework-agnostic manner, enabling seamless integration with various agent architectures and development workflows.
The attendees will share the learnings from leveraging Ray as an Agent Engine, demonstrating the diverse capabilities of Ray Serve– including its built-in autoscaling and traffic management – to optimize performance, ensure robustness, and simplify the management of complex agent deployments. Attendees will gain valuable insights into building scalable and resilient agent-powered applications using Ray Serve, regardless of the underlying framework or agent design.
The attendees will share the learnings from leveraging Ray as an Agent Engine, demonstrating the diverse capabilities of Ray Serve– including its built-in autoscaling and traffic management – to optimize performance, ensure robustness, and simplify the management of complex agent deployments. Attendees will gain valuable insights into building scalable and resilient agent-powered applications using Ray Serve, regardless of the underlying framework or agent design.
12:00 PM - 12:15 PMLightning Theater
Reinforcement Learning
Text / Docs
Structured Data
PERSONALIZE & OPTIMIZE AT SCALE: Ray-Based Adaptive Experimentation for RL Policies
Grab, Southeast Asia's leading superapp, is transforming real-world decision-making through an advanced Reinforcement Learning (RL) platform, powered by Ray. This scalable distributed computing framework streamlines the entire RL lifecycle, from intricate model training to seamless deployment, enabling Grab to deliver highly localized and adaptive digital experiences across its diverse markets. Key applications include Dynamic Pricing Models, Personalized Ads and Recommendations, and sophisticated Demand Forecasting.
The Imperative for Scalable RL at Grab:
Grab operates in a profoundly dynamic and complex environment, serving millions of users across numerous countries. Each region presents unique customer behaviors, regulatory landscapes, and real-time variables such as peak traffic, sudden weather changes (e.g., heavy rainfall in Singapore), and fluctuating supply-demand dynamics.
To provide an optimal and hyper-relevant experience, Grab's machine learning models must be:
1. Hyper-Localized: Precisely tuned for granular, real-time local conditions.
2. Real-time Adaptability: Capable of instantaneously responding to evolving circumstances.
3. Model Diversity: Able to concurrently deploy and rigorously evaluate a wide array of specialized models.
Effectively managing this inherent complexity necessitates a highly robust, flexible, and fundamentally scalable infrastructure. Ray, with its comprehensive suite of libraries, proves indispensable to Grab's operational excellence.
Ray's End-to-End Impact: From Training to Deployment
Ray empowers Grab to execute sophisticated RL experiments and seamlessly transition them into production, optimizing every stage of the model lifecycle:
1. Accelerated Training with Ray RLlib & Ray Tune
Grab's RL training pipeline leverages Ray RLlib and PyTorch Lightning for unparalleled efficiency:
Exceptional Scalability: Ray's distributed runtime scales RL training across multiple nodes and GPUs, facilitating rapid iteration on vast datasets and complex simulation environments.
Comprehensive Algorithm Suite & Customization: RLlib offers a rich, production-ready collection of RL algorithms, enabling rapid implementation and experimentation. Its modular design allows seamless integration of custom models, environments, and metrics, tailoring solutions to unique business challenges.
Automated Optimization: Ray Tune is instrumental in efficient experiment management. It automates hyperparameter tuning, checkpoint management, and meticulous tracking of experiment results, consistently leading to higher-performing models with reduced manual effort.
2. Seamless Deployment & Online Evaluation with RayServe
The ability to deploy and rigorously evaluate models in real-time is paramount for Grab's adaptive services. RayServe plays a pivotal role in this crucial phase:
Parallel Model Serving: RayServe enables Grab to run and manage multiple model versions concurrently, vital for A/B testing and continuous improvement without disrupting live services.
Contextual Traffic Routing: Intelligent routing of user traffic to the most appropriate model based on immediate context. For instance, a pricing model optimized for rush hour or a demand-supply matching policy adjusted for heavy rain can be dynamically served, ensuring an optimal user experience.
Scaled Operationalization: RayServe provides the robust infrastructure to operationalize these highly localized and context-aware model variations at immense scale, ensuring sustained speed, reliability, and adaptability across Grab's vast operations.
By abstracting away the inherent complexities of building scalable RL pipelines, Ray empowers Grab's data scientists and engineers to concentrate efforts on core innovation and rapid deployment. This strategic approach not only dramatically accelerates development across Grab's AI systems but also rigorously upholds best practices in reproducibility, modularity, and scalability, solidifying Grab's commitment to delivering cutting-edge, adaptive services.
Read more
Grab, Southeast Asia's leading superapp, is transforming real-world decision-making through an advanced Reinforcement Learning (RL) platform, powered by Ray. This scalable distributed computing framework streamlines the entire RL lifecycle, from intricate model training to seamless deployment, enabling Grab to deliver highly localized and adaptive digital experiences across its diverse markets. Key applications include Dynamic Pricing Models, Personalized Ads and Recommendations, and sophisticated Demand Forecasting.
The Imperative for Scalable RL at Grab:
Grab operates in a profoundly dynamic and complex environment, serving millions of users across numerous countries. Each region presents unique customer behaviors, regulatory landscapes, and real-time variables such as peak traffic, sudden weather changes (e.g., heavy rainfall in Singapore), and fluctuating supply-demand dynamics.
To provide an optimal and hyper-relevant experience, Grab's machine learning models must be:
1. Hyper-Localized: Precisely tuned for granular, real-time local conditions.
2. Real-time Adaptability: Capable of instantaneously responding to evolving circumstances.
3. Model Diversity: Able to concurrently deploy and rigorously evaluate a wide array of specialized models.
Effectively managing this inherent complexity necessitates a highly robust, flexible, and fundamentally scalable infrastructure. Ray, with its comprehensive suite of libraries, proves indispensable to Grab's operational excellence.
Ray's End-to-End Impact: From Training to Deployment
Ray empowers Grab to execute sophisticated RL experiments and seamlessly transition them into production, optimizing every stage of the model lifecycle:
1. Accelerated Training with Ray RLlib & Ray Tune
Grab's RL training pipeline leverages Ray RLlib and PyTorch Lightning for unparalleled efficiency:
Exceptional Scalability: Ray's distributed runtime scales RL training across multiple nodes and GPUs, facilitating rapid iteration on vast datasets and complex simulation environments.
Comprehensive Algorithm Suite & Customization: RLlib offers a rich, production-ready collection of RL algorithms, enabling rapid implementation and experimentation. Its modular design allows seamless integration of custom models, environments, and metrics, tailoring solutions to unique business challenges.
Automated Optimization: Ray Tune is instrumental in efficient experiment management. It automates hyperparameter tuning, checkpoint management, and meticulous tracking of experiment results, consistently leading to higher-performing models with reduced manual effort.
2. Seamless Deployment & Online Evaluation with RayServe
The ability to deploy and rigorously evaluate models in real-time is paramount for Grab's adaptive services. RayServe plays a pivotal role in this crucial phase:
Parallel Model Serving: RayServe enables Grab to run and manage multiple model versions concurrently, vital for A/B testing and continuous improvement without disrupting live services.
Contextual Traffic Routing: Intelligent routing of user traffic to the most appropriate model based on immediate context. For instance, a pricing model optimized for rush hour or a demand-supply matching policy adjusted for heavy rain can be dynamically served, ensuring an optimal user experience.
Scaled Operationalization: RayServe provides the robust infrastructure to operationalize these highly localized and context-aware model variations at immense scale, ensuring sustained speed, reliability, and adaptability across Grab's vast operations.
By abstracting away the inherent complexities of building scalable RL pipelines, Ray empowers Grab's data scientists and engineers to concentrate efforts on core innovation and rapid deployment. This strategic approach not only dramatically accelerates development across Grab's AI systems but also rigorously upholds best practices in reproducibility, modularity, and scalability, solidifying Grab's commitment to delivering cutting-edge, adaptive services.
The Imperative for Scalable RL at Grab:
Grab operates in a profoundly dynamic and complex environment, serving millions of users across numerous countries. Each region presents unique customer behaviors, regulatory landscapes, and real-time variables such as peak traffic, sudden weather changes (e.g., heavy rainfall in Singapore), and fluctuating supply-demand dynamics.
To provide an optimal and hyper-relevant experience, Grab's machine learning models must be:
1. Hyper-Localized: Precisely tuned for granular, real-time local conditions.
2. Real-time Adaptability: Capable of instantaneously responding to evolving circumstances.
3. Model Diversity: Able to concurrently deploy and rigorously evaluate a wide array of specialized models.
Effectively managing this inherent complexity necessitates a highly robust, flexible, and fundamentally scalable infrastructure. Ray, with its comprehensive suite of libraries, proves indispensable to Grab's operational excellence.
Ray's End-to-End Impact: From Training to Deployment
Ray empowers Grab to execute sophisticated RL experiments and seamlessly transition them into production, optimizing every stage of the model lifecycle:
1. Accelerated Training with Ray RLlib & Ray Tune
Grab's RL training pipeline leverages Ray RLlib and PyTorch Lightning for unparalleled efficiency:
Exceptional Scalability: Ray's distributed runtime scales RL training across multiple nodes and GPUs, facilitating rapid iteration on vast datasets and complex simulation environments.
Comprehensive Algorithm Suite & Customization: RLlib offers a rich, production-ready collection of RL algorithms, enabling rapid implementation and experimentation. Its modular design allows seamless integration of custom models, environments, and metrics, tailoring solutions to unique business challenges.
Automated Optimization: Ray Tune is instrumental in efficient experiment management. It automates hyperparameter tuning, checkpoint management, and meticulous tracking of experiment results, consistently leading to higher-performing models with reduced manual effort.
2. Seamless Deployment & Online Evaluation with RayServe
The ability to deploy and rigorously evaluate models in real-time is paramount for Grab's adaptive services. RayServe plays a pivotal role in this crucial phase:
Parallel Model Serving: RayServe enables Grab to run and manage multiple model versions concurrently, vital for A/B testing and continuous improvement without disrupting live services.
Contextual Traffic Routing: Intelligent routing of user traffic to the most appropriate model based on immediate context. For instance, a pricing model optimized for rush hour or a demand-supply matching policy adjusted for heavy rain can be dynamically served, ensuring an optimal user experience.
Scaled Operationalization: RayServe provides the robust infrastructure to operationalize these highly localized and context-aware model variations at immense scale, ensuring sustained speed, reliability, and adaptability across Grab's vast operations.
By abstracting away the inherent complexities of building scalable RL pipelines, Ray empowers Grab's data scientists and engineers to concentrate efforts on core innovation and rapid deployment. This strategic approach not only dramatically accelerates development across Grab's AI systems but also rigorously upholds best practices in reproducibility, modularity, and scalability, solidifying Grab's commitment to delivering cutting-edge, adaptive services.
12:15 PM - 12:30 PMLightning Theater
Machine Learning
SQL or Python? Wrong Question for the AI Era
The lines between data and model infrastructure are blurring. As GenAI applications push for tighter integration between training, retrieval, transformation, and serving, the same distributed compute patterns are showing up everywhere. This talk explores whether Pythonic data frameworks and SQL engines are beginning to converge, and explores which ideas each brings to the table as the next generation of data systems takes shape.
Read more
The lines between data and model infrastructure are blurring. As GenAI applications push for tighter integration between training, retrieval, transformation, and serving, the same distributed compute patterns are showing up everywhere. This talk explores whether Pythonic data frameworks and SQL engines are beginning to converge, and explores which ideas each brings to the table as the next generation of data systems takes shape.
Read more12:30 PM - 12:45 PMLightning Theater
LLMs
Orchestrating the GenAI Lifecycle with KubeRay: Training, Inference and Benchmarking
Managing the lifecycle of GenAI models from training and fine-tuning to inference and benchmarking often becomes a messy mix of scripts, containers, and manual setup. This session will show how KubeRay streamlines it all by acting as a unified, scalable layer for managing distributed model workflows on Kubernetes.
We’ll showcase real-world examples of using Ray and KubeRay to orchestrate grid search experiments, automate LoRA fine-tuning and model containerisation, and parallelize large-scale benchmarking workloads. You’ll see how KubeRay enables fast orchestration of reproducible experiments and how we’ve built environment-agnostic workflows using Ray that integrate seamlessly into any Kubernetes cluster.
Read more
Managing the lifecycle of GenAI models from training and fine-tuning to inference and benchmarking often becomes a messy mix of scripts, containers, and manual setup. This session will show how KubeRay streamlines it all by acting as a unified, scalable layer for managing distributed model workflows on Kubernetes.
We’ll showcase real-world examples of using Ray and KubeRay to orchestrate grid search experiments, automate LoRA fine-tuning and model containerisation, and parallelize large-scale benchmarking workloads. You’ll see how KubeRay enables fast orchestration of reproducible experiments and how we’ve built environment-agnostic workflows using Ray that integrate seamlessly into any Kubernetes cluster.
We’ll showcase real-world examples of using Ray and KubeRay to orchestrate grid search experiments, automate LoRA fine-tuning and model containerisation, and parallelize large-scale benchmarking workloads. You’ll see how KubeRay enables fast orchestration of reproducible experiments and how we’ve built environment-agnostic workflows using Ray that integrate seamlessly into any Kubernetes cluster.
12:45 PM - 01:00 PMLightning Theater
Machine Learning
Intel® Xeon® 6 Processors for Efficient Performant AI Inferencing
AI Inference is experiencing unprecedented growth, significantly outpacing training workloads in enterprise deployments. This talk showcases Intel® Xeon® 6 processors as a robust and cost-effective solution for AI inference on small to medium language models, targeting production-critical use cases, including RAG applications, intelligent chatbots, automated content creation, document summarization, and emerging agentic workflows, where CPU-based inference delivers optimal performance at specific pipeline stages.
We'll explore Intel Xeon® 6's AI-focused hardware enhancements, including built-in AMX (Advanced Matrix Extensions) that unlock significant performance gains, enhanced memory bandwidth (MRDIMMs), multi-core parallelism, and confidential computing (Intel TDX) that provide reliable, secure, and low-latency execution for enterprise AI workloads. To accelerate time-to-production, we'll briefly cover streamlined deployment through vLLM integration and comprehensive solutions from Intel® AI for Enterprise Inference and Intel® AI Enterprise RAG. These turnkey solutions allow customers and developers to focus on application development rather than complex infrastructure configuration.
A key highlight of this talk will cover recent advancements in production deployment acceleration, showcasing how Anyscale Ray amplifies this capability - orchestrating distributed inference jobs seamlessly across hybrid infrastructure, optimizing resource allocation, and enabling dynamic scaling for variable AI traffic. Together, Xeon and Ray form a unified compute fabric that democratizes AI inference: accessible, scalable, and resilient across industries.
Read more
AI Inference is experiencing unprecedented growth, significantly outpacing training workloads in enterprise deployments. This talk showcases Intel® Xeon® 6 processors as a robust and cost-effective solution for AI inference on small to medium language models, targeting production-critical use cases, including RAG applications, intelligent chatbots, automated content creation, document summarization, and emerging agentic workflows, where CPU-based inference delivers optimal performance at specific pipeline stages.
We'll explore Intel Xeon® 6's AI-focused hardware enhancements, including built-in AMX (Advanced Matrix Extensions) that unlock significant performance gains, enhanced memory bandwidth (MRDIMMs), multi-core parallelism, and confidential computing (Intel TDX) that provide reliable, secure, and low-latency execution for enterprise AI workloads. To accelerate time-to-production, we'll briefly cover streamlined deployment through vLLM integration and comprehensive solutions from Intel® AI for Enterprise Inference and Intel® AI Enterprise RAG. These turnkey solutions allow customers and developers to focus on application development rather than complex infrastructure configuration.
A key highlight of this talk will cover recent advancements in production deployment acceleration, showcasing how Anyscale Ray amplifies this capability - orchestrating distributed inference jobs seamlessly across hybrid infrastructure, optimizing resource allocation, and enabling dynamic scaling for variable AI traffic. Together, Xeon and Ray form a unified compute fabric that democratizes AI inference: accessible, scalable, and resilient across industries.
We'll explore Intel Xeon® 6's AI-focused hardware enhancements, including built-in AMX (Advanced Matrix Extensions) that unlock significant performance gains, enhanced memory bandwidth (MRDIMMs), multi-core parallelism, and confidential computing (Intel TDX) that provide reliable, secure, and low-latency execution for enterprise AI workloads. To accelerate time-to-production, we'll briefly cover streamlined deployment through vLLM integration and comprehensive solutions from Intel® AI for Enterprise Inference and Intel® AI Enterprise RAG. These turnkey solutions allow customers and developers to focus on application development rather than complex infrastructure configuration.
A key highlight of this talk will cover recent advancements in production deployment acceleration, showcasing how Anyscale Ray amplifies this capability - orchestrating distributed inference jobs seamlessly across hybrid infrastructure, optimizing resource allocation, and enabling dynamic scaling for variable AI traffic. Together, Xeon and Ray form a unified compute fabric that democratizes AI inference: accessible, scalable, and resilient across industries.
01:00 PM - 01:30 PMGolden Gate A
Machine Learning
Image
Video
Scaling Image and Video Processing with Ray
xAI is building the world’s most powerful AI models to advance human comprehension and capabilities. Multimodal data plays a central role in this effort. To meet the extreme demands of large-scale multimodal training, we have developed a high-performance data processing stack powered by Ray Core and KubeRay. This system enables efficient distributed processing of image and video data with linear scalability, and robust fault tolerance in production environments.
In this talk, we will present the architecture of our Ray based data pipeline built on top of KubeRay, and strategies for achieving high availability and operational simplicity at supercluster scale.
Read more
xAI is building the world’s most powerful AI models to advance human comprehension and capabilities. Multimodal data plays a central role in this effort. To meet the extreme demands of large-scale multimodal training, we have developed a high-performance data processing stack powered by Ray Core and KubeRay. This system enables efficient distributed processing of image and video data with linear scalability, and robust fault tolerance in production environments.
In this talk, we will present the architecture of our Ray based data pipeline built on top of KubeRay, and strategies for achieving high availability and operational simplicity at supercluster scale.
In this talk, we will present the architecture of our Ray based data pipeline built on top of KubeRay, and strategies for achieving high availability and operational simplicity at supercluster scale.
01:00 PM - 01:30 PMGolden Gate B
Reinforcement Learning
Physical AI
Image
Structured Data
Video
Ray at Applied Intuition: Scaling Batch Inference and RL
Applied Intuition uses Ray to scale large-scale inference and reinforcement learning workloads operating on petabytes of raw sensor data for autonomous driving. In this talk, we will first cover Ray’s role within Applied’s ML infrastructure and how it enables unified, distributed execution across Kubernetes clusters. We’ll then discuss how Ray Data powers large-scale batch inference pipelines, streaming sensor data from our lake, performing CPU-intensive transformations, and seamlessly feeding into GPU inference at scale. Finally, we’ll dive into how Ray’s distributed execution model and RLlib enable scalable open- and closed-loop reinforcement learning—running thousands of parallel rollouts, colocating GPU learners with simulators, and recovering full state efficiently during training. We’ll also share our experience managing Ray infrastructure in production and practical tips for applying Ray to inference and reinforcement learning workloads.
Read more
Applied Intuition uses Ray to scale large-scale inference and reinforcement learning workloads operating on petabytes of raw sensor data for autonomous driving. In this talk, we will first cover Ray’s role within Applied’s ML infrastructure and how it enables unified, distributed execution across Kubernetes clusters. We’ll then discuss how Ray Data powers large-scale batch inference pipelines, streaming sensor data from our lake, performing CPU-intensive transformations, and seamlessly feeding into GPU inference at scale. Finally, we’ll dive into how Ray’s distributed execution model and RLlib enable scalable open- and closed-loop reinforcement learning—running thousands of parallel rollouts, colocating GPU learners with simulators, and recovering full state efficiently during training. We’ll also share our experience managing Ray infrastructure in production and practical tips for applying Ray to inference and reinforcement learning workloads.
Read more01:00 PM - 01:30 PMGolden Gate C1
LLMs
Media & Gaming
Text / Docs
Scaling LLM Post-Training at Character.AI
Character AI is the world's leading application for AI entertainment, serving tens of millions of users per day with large language models (LLMs). To continuously improve the models that power our AI Characters, we have built a robust and scalable post-training stack entirely on open-source technologies in the Ray ecosystem. Our fine-tuning stack, internally named Rayman, has allowed us to accelerate our model development velocity and large MoE model training efficiency. We are also able to utilize and adapt open-source RL libraries(Verl) to deal with our unique challenges in RL training. In this talk, we will detail the architecture of Rayman, the open-source projects we leverage, our RL framework and the ML challenges we've overcome.
Specifically, we will cover:
1. Infrastructure for Fine-tuning and Distillation: We will introduce Rayman, our internal framework built on Ray Data, Ray Train, and Deepspeed/PyTorch FSDP/Internal Pipeline SFT for orchestrating all distributed workloads. We'll detail how we use this for large-scale SFT, DPO including our strategy for training massive Mixture-of-Experts (MoE) models like those from DeepSeek. We will also go over our approach for knowledge distillation of state of the art open source LLMs into smaller, more tractable models.
2. Reinforcement Learning from Real User Feedback: A core challenge in aligning models for open-ended creative dialogue is that there are no verifiable rewards. We will discuss how we tackle this problem by training our own reward models on real user interaction data which we then use for RL. We'll detail our RL framework built on top of Verl, which allows us to translate noisy, real-world user feedback into a clear signal that can be effectively "hill-climbed" using a variety of reinforcement learning techniques to significantly improve the quality of our models.
Read more
Character AI is the world's leading application for AI entertainment, serving tens of millions of users per day with large language models (LLMs). To continuously improve the models that power our AI Characters, we have built a robust and scalable post-training stack entirely on open-source technologies in the Ray ecosystem. Our fine-tuning stack, internally named Rayman, has allowed us to accelerate our model development velocity and large MoE model training efficiency. We are also able to utilize and adapt open-source RL libraries(Verl) to deal with our unique challenges in RL training. In this talk, we will detail the architecture of Rayman, the open-source projects we leverage, our RL framework and the ML challenges we've overcome.
Specifically, we will cover:
1. Infrastructure for Fine-tuning and Distillation: We will introduce Rayman, our internal framework built on Ray Data, Ray Train, and Deepspeed/PyTorch FSDP/Internal Pipeline SFT for orchestrating all distributed workloads. We'll detail how we use this for large-scale SFT, DPO including our strategy for training massive Mixture-of-Experts (MoE) models like those from DeepSeek. We will also go over our approach for knowledge distillation of state of the art open source LLMs into smaller, more tractable models.
2. Reinforcement Learning from Real User Feedback: A core challenge in aligning models for open-ended creative dialogue is that there are no verifiable rewards. We will discuss how we tackle this problem by training our own reward models on real user interaction data which we then use for RL. We'll detail our RL framework built on top of Verl, which allows us to translate noisy, real-world user feedback into a clear signal that can be effectively "hill-climbed" using a variety of reinforcement learning techniques to significantly improve the quality of our models.
Specifically, we will cover:
1. Infrastructure for Fine-tuning and Distillation: We will introduce Rayman, our internal framework built on Ray Data, Ray Train, and Deepspeed/PyTorch FSDP/Internal Pipeline SFT for orchestrating all distributed workloads. We'll detail how we use this for large-scale SFT, DPO including our strategy for training massive Mixture-of-Experts (MoE) models like those from DeepSeek. We will also go over our approach for knowledge distillation of state of the art open source LLMs into smaller, more tractable models.
2. Reinforcement Learning from Real User Feedback: A core challenge in aligning models for open-ended creative dialogue is that there are no verifiable rewards. We will discuss how we tackle this problem by training our own reward models on real user interaction data which we then use for RL. We'll detail our RL framework built on top of Verl, which allows us to translate noisy, real-world user feedback into a clear signal that can be effectively "hill-climbed" using a variety of reinforcement learning techniques to significantly improve the quality of our models.
01:00 PM - 01:30 PMGolden Gate C2
vLLM
Text / Docs
State of vLLM 2025
In this talk, we will cover the latest one year in review for the vLLM project and discuss the road ahead.
Read more
In this talk, we will cover the latest one year in review for the vLLM project and discuss the road ahead.
Read more01:00 PM - 01:30 PMGolden Gate C3
Machine Learning
Text / Docs
Image
Structured Data
Accelerating AI Pipelines with AnalyticDB Ray: Alibaba Cloud's Approach to Data-AI Convergence
Abstract: In the era of data-driven innovation, efficiently processing and analyzing multi-modal data is crucial for building effective AI pipelines. This presentation will focus on real-world applications of Alibaba Cloud's AnalyticDB Ray, showcasing how its capabilities are leveraged within a data warehouse environment to accelerate AI initiatives.
We will delve into practical use cases that demonstrate the seamless integration of multi-modal ETL and machine learning:
1. Optimizing Advertising Recommendation Inference: Learn how AnalyticDB Ray is used for offline batch inference in advertising recommendations, specifically for estimating click-through rates (CTR). This includes details on how heterogeneous resources (CPU and GPU) are independently and automatically scaled to maximize GPU utilization, achieving an increase from less than 5% to 40%. We will also discuss the dynamic auto-scaling of object storage based on data volume, which has improved data processing performance by 2 to 3 times.
2. Accelerating Large Language Model (LLM) Offline Batch Inference and Data Distillation: Discover how AnalyticDB Ray facilitates large model data preparation. We will illustrate the use of Ray Data and vLLM/SGLang for data distillation with models like Qwen and Deepseek, which then fuels large model training. Key benefits include a 2-3x improvement in data loading throughput due to caching , scheduling of 40,000 fine-grained tasks within a single Ray cluster , and a 50% performance increase for Deepseek INT8 quantization compared to FP8 in offline distillation scenarios.
3. Efficient Distributed Fine-tuning of Multi-modal Models: Explore how AnalyticDB Ray, integrated with Lance, enhances distributed image-text data processing and structuring using RayData for multi-modal personalized interactive scenarios. We will also showcase the integration with LLaMA-Factory to provide distributed fine-tuning capabilities for Qwen-VL multi-modal models. This offers a one-stop solution from data labeling to model fine-tuning and has improved distributed fine-tuning efficiency by 3-5 times.
These examples will illustrate how AnalyticDB Ray unlocks the potential for in-warehouse AI pipelines, seamlessly integrating multi-modal ETL and machine learning to accelerate the journey from data to intelligent decision-making.
Read more
Abstract: In the era of data-driven innovation, efficiently processing and analyzing multi-modal data is crucial for building effective AI pipelines. This presentation will focus on real-world applications of Alibaba Cloud's AnalyticDB Ray, showcasing how its capabilities are leveraged within a data warehouse environment to accelerate AI initiatives.
We will delve into practical use cases that demonstrate the seamless integration of multi-modal ETL and machine learning:
1. Optimizing Advertising Recommendation Inference: Learn how AnalyticDB Ray is used for offline batch inference in advertising recommendations, specifically for estimating click-through rates (CTR). This includes details on how heterogeneous resources (CPU and GPU) are independently and automatically scaled to maximize GPU utilization, achieving an increase from less than 5% to 40%. We will also discuss the dynamic auto-scaling of object storage based on data volume, which has improved data processing performance by 2 to 3 times.
2. Accelerating Large Language Model (LLM) Offline Batch Inference and Data Distillation: Discover how AnalyticDB Ray facilitates large model data preparation. We will illustrate the use of Ray Data and vLLM/SGLang for data distillation with models like Qwen and Deepseek, which then fuels large model training. Key benefits include a 2-3x improvement in data loading throughput due to caching , scheduling of 40,000 fine-grained tasks within a single Ray cluster , and a 50% performance increase for Deepseek INT8 quantization compared to FP8 in offline distillation scenarios.
3. Efficient Distributed Fine-tuning of Multi-modal Models: Explore how AnalyticDB Ray, integrated with Lance, enhances distributed image-text data processing and structuring using RayData for multi-modal personalized interactive scenarios. We will also showcase the integration with LLaMA-Factory to provide distributed fine-tuning capabilities for Qwen-VL multi-modal models. This offers a one-stop solution from data labeling to model fine-tuning and has improved distributed fine-tuning efficiency by 3-5 times.
These examples will illustrate how AnalyticDB Ray unlocks the potential for in-warehouse AI pipelines, seamlessly integrating multi-modal ETL and machine learning to accelerate the journey from data to intelligent decision-making.
We will delve into practical use cases that demonstrate the seamless integration of multi-modal ETL and machine learning:
1. Optimizing Advertising Recommendation Inference: Learn how AnalyticDB Ray is used for offline batch inference in advertising recommendations, specifically for estimating click-through rates (CTR). This includes details on how heterogeneous resources (CPU and GPU) are independently and automatically scaled to maximize GPU utilization, achieving an increase from less than 5% to 40%. We will also discuss the dynamic auto-scaling of object storage based on data volume, which has improved data processing performance by 2 to 3 times.
2. Accelerating Large Language Model (LLM) Offline Batch Inference and Data Distillation: Discover how AnalyticDB Ray facilitates large model data preparation. We will illustrate the use of Ray Data and vLLM/SGLang for data distillation with models like Qwen and Deepseek, which then fuels large model training. Key benefits include a 2-3x improvement in data loading throughput due to caching , scheduling of 40,000 fine-grained tasks within a single Ray cluster , and a 50% performance increase for Deepseek INT8 quantization compared to FP8 in offline distillation scenarios.
3. Efficient Distributed Fine-tuning of Multi-modal Models: Explore how AnalyticDB Ray, integrated with Lance, enhances distributed image-text data processing and structuring using RayData for multi-modal personalized interactive scenarios. We will also showcase the integration with LLaMA-Factory to provide distributed fine-tuning capabilities for Qwen-VL multi-modal models. This offers a one-stop solution from data labeling to model fine-tuning and has improved distributed fine-tuning efficiency by 3-5 times.
These examples will illustrate how AnalyticDB Ray unlocks the potential for in-warehouse AI pipelines, seamlessly integrating multi-modal ETL and machine learning to accelerate the journey from data to intelligent decision-making.
01:00 PM - 01:30 PMYerba Buena 2-3
Reinforcement Learning
Scaling Reinforcement Learning with resiliency, elasticity, and efficiency across thousands of GPUs
As RL expands from gaming and robotics to LLM alignment and real-world control with trillion parameter base models, robustness, scalability, and elasticity matter as much as raw speed. Many RL pipelines crumble at cluster scale not for lack of GPUs, but because GPU failures, preemptions, and tail latencies erode goodput—the amount of useful learning per GPU-hour. Ray offers a unified programming model that allows users to seamlessly scale applications from a single machine to a distributed cluster, offering unparalleled developer velocity. Popular RL frameworks like Verl uses Ray to spin up workers and move data across distributed nodes.
Amazon SageMaker HyperPod offers a persistent GPU cluster optimized for scaling distributed AI. Combining the efficiency of Ray with resiliency of HyperPod offers a seamless, scalable, solution for large scale post training workloads. In this session we demonstrate how to build a fault-tolerant RL factory for LLM alignment with PPO/GRPO/DAPO by combining open-source frameworks such as Verl with Amazon SageMaker HyperPod. We will dive deeper into the architectural details of running Ray Jobs on HyperPod over a cluster of hundreds of thousands of GPUs and leveraging vLLM for inference workers along. We will share reference examples for post-training of large open weight models while optimizing performance and goodput.
Read more
As RL expands from gaming and robotics to LLM alignment and real-world control with trillion parameter base models, robustness, scalability, and elasticity matter as much as raw speed. Many RL pipelines crumble at cluster scale not for lack of GPUs, but because GPU failures, preemptions, and tail latencies erode goodput—the amount of useful learning per GPU-hour. Ray offers a unified programming model that allows users to seamlessly scale applications from a single machine to a distributed cluster, offering unparalleled developer velocity. Popular RL frameworks like Verl uses Ray to spin up workers and move data across distributed nodes.
Amazon SageMaker HyperPod offers a persistent GPU cluster optimized for scaling distributed AI. Combining the efficiency of Ray with resiliency of HyperPod offers a seamless, scalable, solution for large scale post training workloads. In this session we demonstrate how to build a fault-tolerant RL factory for LLM alignment with PPO/GRPO/DAPO by combining open-source frameworks such as Verl with Amazon SageMaker HyperPod. We will dive deeper into the architectural details of running Ray Jobs on HyperPod over a cluster of hundreds of thousands of GPUs and leveraging vLLM for inference workers along. We will share reference examples for post-training of large open weight models while optimizing performance and goodput.
Amazon SageMaker HyperPod offers a persistent GPU cluster optimized for scaling distributed AI. Combining the efficiency of Ray with resiliency of HyperPod offers a seamless, scalable, solution for large scale post training workloads. In this session we demonstrate how to build a fault-tolerant RL factory for LLM alignment with PPO/GRPO/DAPO by combining open-source frameworks such as Verl with Amazon SageMaker HyperPod. We will dive deeper into the architectural details of running Ray Jobs on HyperPod over a cluster of hundreds of thousands of GPUs and leveraging vLLM for inference workers along. We will share reference examples for post-training of large open weight models while optimizing performance and goodput.
01:00 PM - 01:30 PMYerba Buena Salon 4-6
Ray Deep Dives
Ray: Last Year's Progress and the Road Ahead
This session will highlight what we've achieved this year to improve performance, resiliency, and observability for the cutting-edge applications that run on Ray. Highlights include Ray Direct Transport to optimize data transfer between accelerators, native resource isolation with cgroups, increased resiliency to network failures, and improved observability at scale. We will also preview the roadmap for Ray Core in 2026.
Read more
This session will highlight what we've achieved this year to improve performance, resiliency, and observability for the cutting-edge applications that run on Ray. Highlights include Ray Direct Transport to optimize data transfer between accelerators, native resource isolation with cgroups, increased resiliency to network failures, and improved observability at scale. We will also preview the roadmap for Ray Core in 2026.
Read more01:00 PM - 01:30 PMYerba Buena Salon 10-12
Reinforcement Learning
Terminal-Bench: an open-source benchmark for language models as agents in realistic terminal environments.
AI agents may soon become capable of autonomously completing valuable, long-horizon tasks in diverse domains. Current benchmarks either do not measure real-world tasks, or are not sufficiently difficult to meaningfully measure frontier models. To this end, we present Terminal-Bench: a carefully curated hard benchmark composed of tasks in computer terminal environments inspired by problems from real workflows. We will also share reflections on progress on Terminal-Bench, details on TB 2.0, and our vision for unifying agent evals and training under a new open-source framework.
Read more
AI agents may soon become capable of autonomously completing valuable, long-horizon tasks in diverse domains. Current benchmarks either do not measure real-world tasks, or are not sufficiently difficult to meaningfully measure frontier models. To this end, we present Terminal-Bench: a carefully curated hard benchmark composed of tasks in computer terminal environments inspired by problems from real workflows. We will also share reflections on progress on Terminal-Bench, details on TB 2.0, and our vision for unifying agent evals and training under a new open-source framework.
Read more01:30 PM - 01:45 PMLightning Theater
Lightning Talk
Text / Docs
Horizontal, Predictable, High-Throughput Inference for Synthetic Data Generation, Evals, and More
Sutro https://sutro.sh/, is an accelerated batch inference service. We use vLLM under the hood to power offline inference workloads ranging from a few hundred to tens of billions of tokens, often for synthetic data generation, evals, or processing unstructured data. It's critical for us to be able to use vLLM in a predictable way - from a cost, performance, and transparency standpoint. In this talk we'll explain how we use vLLM under the hood, from our custom implementation, to our performance profiler, throughput estimation algorithms, and cost attribution instrumentation. This talk is geared towards teams looking to push the boundaries of what's possible with vLLM at scale.
Read more
Sutro https://sutro.sh/, is an accelerated batch inference service. We use vLLM under the hood to power offline inference workloads ranging from a few hundred to tens of billions of tokens, often for synthetic data generation, evals, or processing unstructured data. It's critical for us to be able to use vLLM in a predictable way - from a cost, performance, and transparency standpoint. In this talk we'll explain how we use vLLM under the hood, from our custom implementation, to our performance profiler, throughput estimation algorithms, and cost attribution instrumentation. This talk is geared towards teams looking to push the boundaries of what's possible with vLLM at scale.
Read more01:45 PM - 02:15 PMGolden Gate A
Machine Learning
Structured Data
Building RayLab: Autodesk’s Journey to Scalable Deep Learning Infrastructure
In this presentation, we describe Autodesk's journey to enabling large-scale deep learning across the company. We began by exploring managed solutions like AWS Batch and SageMaker, but quickly ran into challenges around scalability, customization, networking, and developer experience. To overcome these limitations, we turned to Ray and KubeRay, which offered the flexibility and control we needed. Building on top of these technologies, we developed RayLab - Autodesk's internal platform for scalable training, data processing, and model serving.
RayLab is a Kubernetes-native platform that abstracts away infrastructure complexity while supporting secure, efficient, and user-friendly workflows. We built wrappers for Ray cluster management via a CLI, Web UI, and Python SDK to simplify usage, reduce onboarding friction, and ensure compliance with Autodesk's internal security and networking requirements. We'll describe the architecture behind RayLab, which includes Kubernetes, KubeRay, Karpenter, Grafana, and JupyterHub - all secured with role-based access control and designed for multi-tenancy to support ML workspaces across teams.
RayLab provides high-level APIs built on Ray and PyTorch Lightning, allowing users to launch distributed training jobs with minimal code. It includes standardized checkpointing and experiment tracking to support reproducibility and consistent workflows.
We'll also share the challenges we faced in improving the efficiency of our reserved H100 GPU resources, particularly around fair sharing across teams and users. To address this, we implemented quota management and a priority-based scheduling system that enables high-priority jobs to preempt lower-priority ones, significantly increasing utilization. Additionally, RayLab supports automatic downscaling of underutilized clusters to conserve compute.
Finally, we'll conclude with a live demo of RayLab in action.
Read more
In this presentation, we describe Autodesk's journey to enabling large-scale deep learning across the company. We began by exploring managed solutions like AWS Batch and SageMaker, but quickly ran into challenges around scalability, customization, networking, and developer experience. To overcome these limitations, we turned to Ray and KubeRay, which offered the flexibility and control we needed. Building on top of these technologies, we developed RayLab - Autodesk's internal platform for scalable training, data processing, and model serving.
RayLab is a Kubernetes-native platform that abstracts away infrastructure complexity while supporting secure, efficient, and user-friendly workflows. We built wrappers for Ray cluster management via a CLI, Web UI, and Python SDK to simplify usage, reduce onboarding friction, and ensure compliance with Autodesk's internal security and networking requirements. We'll describe the architecture behind RayLab, which includes Kubernetes, KubeRay, Karpenter, Grafana, and JupyterHub - all secured with role-based access control and designed for multi-tenancy to support ML workspaces across teams.
RayLab provides high-level APIs built on Ray and PyTorch Lightning, allowing users to launch distributed training jobs with minimal code. It includes standardized checkpointing and experiment tracking to support reproducibility and consistent workflows.
We'll also share the challenges we faced in improving the efficiency of our reserved H100 GPU resources, particularly around fair sharing across teams and users. To address this, we implemented quota management and a priority-based scheduling system that enables high-priority jobs to preempt lower-priority ones, significantly increasing utilization. Additionally, RayLab supports automatic downscaling of underutilized clusters to conserve compute.
Finally, we'll conclude with a live demo of RayLab in action.
RayLab is a Kubernetes-native platform that abstracts away infrastructure complexity while supporting secure, efficient, and user-friendly workflows. We built wrappers for Ray cluster management via a CLI, Web UI, and Python SDK to simplify usage, reduce onboarding friction, and ensure compliance with Autodesk's internal security and networking requirements. We'll describe the architecture behind RayLab, which includes Kubernetes, KubeRay, Karpenter, Grafana, and JupyterHub - all secured with role-based access control and designed for multi-tenancy to support ML workspaces across teams.
RayLab provides high-level APIs built on Ray and PyTorch Lightning, allowing users to launch distributed training jobs with minimal code. It includes standardized checkpointing and experiment tracking to support reproducibility and consistent workflows.
We'll also share the challenges we faced in improving the efficiency of our reserved H100 GPU resources, particularly around fair sharing across teams and users. To address this, we implemented quota management and a priority-based scheduling system that enables high-priority jobs to preempt lower-priority ones, significantly increasing utilization. Additionally, RayLab supports automatic downscaling of underutilized clusters to conserve compute.
Finally, we'll conclude with a live demo of RayLab in action.
01:45 PM - 02:15 PMGolden Gate B
LLMs
Finance
Text / Docs
Structured Data
Operationalizing Ray for Real-Time Inference at Scale
The KubeRay project simplifies the deployment of Ray clusters on Kubernetes. However, hardening these clusters to meet stringent production non-functional requirements (NFRs) typical of regulated environments such as those in finance requires additional engineering effort. These NFRs include multi-region and multi-AZ resilience, deployments with zero downtime, proactive autoscaling, automated production validation testing, and robust integration with monitoring and logging systems. Achieving this requires strict adherence to integration patterns and architectural discipline. This work outlines the engineering patterns and platform enhancements we applied to deploy a real-time recommendation system on a Ray cluster on EKS in production.
The system architecture routes traffic through Route53 to NLB, ALB, and finally nginx ingress to reach Ray Serve deployments running in isolated Kubernetes namespaces. Ray Serve’s proxy actors distributed across worker pods act as distributed ingress routers for balancing traffic across replicas, buffering during surges, and enabling concurrent processing without blocking. Inference workloads run on dedicated Ray clusters on separate Kubernetes namespaces managed by the centralized KubeRay Operator. Compute and network isolation is ensured by network policies and RBAC.
Autoscaling is handled at three levels: Ray Serve scales replicas based on request queue depth, KubeRay Autoscaler adjusts Ray worker pod counts based on cluster metrics, and the AWS Cluster Autoscaler provisions EC2 instances based on pending pods awaiting compute resources. This ensures responsiveness during traffic spikes while avoiding over-provisioning.
To maximize GPU utilization and reduce latency, the platform leverages Ray Serve’s dynamic batching in combination with vLLM to perform batched LLM inference. This approach ensures high-throughput, low-latency processing, especially under variable request loads, by grouping requests at runtime based on traffic characteristics.
Observability is achieved through an integrated Prometheus and Grafana stack. A PodMonitor scrapes metrics from Ray components, which are then ingested by Prometheus and visualized in Grafana for real-time analysis and alerting. In parallel, a Fluentd DaemonSet captures logs from Ray and application pods, forwarding them to AWS CloudWatch. These logs are then ingested into Splunk for centralized search, monitoring, and audit compliance. The apps are also monitored at cluster level by Dynatrace and Datadog to further enhance observability and monitoring capabilities.
To enable robust and disruption-free deployments, the platform uses a blue-green deployment pipeline built with Spinnaker. The pipeline includes progressive rollout stages, automated validation, manual approval gates, and rollback paths.
This robust system demonstrates a scalable, resilient, and observability-driven approach to deploying real-time inference and LLM workloads on Ray. Attendees will gain valuable insights into end-to-end development and deployment architecture, GPU workload optimization, and operationalization of Ray Serve in production environments.
Read more
The KubeRay project simplifies the deployment of Ray clusters on Kubernetes. However, hardening these clusters to meet stringent production non-functional requirements (NFRs) typical of regulated environments such as those in finance requires additional engineering effort. These NFRs include multi-region and multi-AZ resilience, deployments with zero downtime, proactive autoscaling, automated production validation testing, and robust integration with monitoring and logging systems. Achieving this requires strict adherence to integration patterns and architectural discipline. This work outlines the engineering patterns and platform enhancements we applied to deploy a real-time recommendation system on a Ray cluster on EKS in production.
The system architecture routes traffic through Route53 to NLB, ALB, and finally nginx ingress to reach Ray Serve deployments running in isolated Kubernetes namespaces. Ray Serve’s proxy actors distributed across worker pods act as distributed ingress routers for balancing traffic across replicas, buffering during surges, and enabling concurrent processing without blocking. Inference workloads run on dedicated Ray clusters on separate Kubernetes namespaces managed by the centralized KubeRay Operator. Compute and network isolation is ensured by network policies and RBAC.
Autoscaling is handled at three levels: Ray Serve scales replicas based on request queue depth, KubeRay Autoscaler adjusts Ray worker pod counts based on cluster metrics, and the AWS Cluster Autoscaler provisions EC2 instances based on pending pods awaiting compute resources. This ensures responsiveness during traffic spikes while avoiding over-provisioning.
To maximize GPU utilization and reduce latency, the platform leverages Ray Serve’s dynamic batching in combination with vLLM to perform batched LLM inference. This approach ensures high-throughput, low-latency processing, especially under variable request loads, by grouping requests at runtime based on traffic characteristics.
Observability is achieved through an integrated Prometheus and Grafana stack. A PodMonitor scrapes metrics from Ray components, which are then ingested by Prometheus and visualized in Grafana for real-time analysis and alerting. In parallel, a Fluentd DaemonSet captures logs from Ray and application pods, forwarding them to AWS CloudWatch. These logs are then ingested into Splunk for centralized search, monitoring, and audit compliance. The apps are also monitored at cluster level by Dynatrace and Datadog to further enhance observability and monitoring capabilities.
To enable robust and disruption-free deployments, the platform uses a blue-green deployment pipeline built with Spinnaker. The pipeline includes progressive rollout stages, automated validation, manual approval gates, and rollback paths.
This robust system demonstrates a scalable, resilient, and observability-driven approach to deploying real-time inference and LLM workloads on Ray. Attendees will gain valuable insights into end-to-end development and deployment architecture, GPU workload optimization, and operationalization of Ray Serve in production environments.
The system architecture routes traffic through Route53 to NLB, ALB, and finally nginx ingress to reach Ray Serve deployments running in isolated Kubernetes namespaces. Ray Serve’s proxy actors distributed across worker pods act as distributed ingress routers for balancing traffic across replicas, buffering during surges, and enabling concurrent processing without blocking. Inference workloads run on dedicated Ray clusters on separate Kubernetes namespaces managed by the centralized KubeRay Operator. Compute and network isolation is ensured by network policies and RBAC.
Autoscaling is handled at three levels: Ray Serve scales replicas based on request queue depth, KubeRay Autoscaler adjusts Ray worker pod counts based on cluster metrics, and the AWS Cluster Autoscaler provisions EC2 instances based on pending pods awaiting compute resources. This ensures responsiveness during traffic spikes while avoiding over-provisioning.
To maximize GPU utilization and reduce latency, the platform leverages Ray Serve’s dynamic batching in combination with vLLM to perform batched LLM inference. This approach ensures high-throughput, low-latency processing, especially under variable request loads, by grouping requests at runtime based on traffic characteristics.
Observability is achieved through an integrated Prometheus and Grafana stack. A PodMonitor scrapes metrics from Ray components, which are then ingested by Prometheus and visualized in Grafana for real-time analysis and alerting. In parallel, a Fluentd DaemonSet captures logs from Ray and application pods, forwarding them to AWS CloudWatch. These logs are then ingested into Splunk for centralized search, monitoring, and audit compliance. The apps are also monitored at cluster level by Dynatrace and Datadog to further enhance observability and monitoring capabilities.
To enable robust and disruption-free deployments, the platform uses a blue-green deployment pipeline built with Spinnaker. The pipeline includes progressive rollout stages, automated validation, manual approval gates, and rollback paths.
This robust system demonstrates a scalable, resilient, and observability-driven approach to deploying real-time inference and LLM workloads on Ray. Attendees will gain valuable insights into end-to-end development and deployment architecture, GPU workload optimization, and operationalization of Ray Serve in production environments.
01:45 PM - 02:15 PMGolden Gate C1
Machine Learning
Text / Docs
Structured Data
Revolutionizing Model Serving with a 50x Cost Reduction using Ray Serve at Workday
Workday uses a tenanted, regionalized architecture in order to ensure data isolation and in-region execution, both of which are crucial requirements for our customers. In early 2023, facing challenges with the ever-increasing scale and cost required to serve dedicated ML models for every tenant in every environment, we decided to completely redo how we serve models using a hot new technology: Ray! We now use Ray Serve to serve tens of thousands of ML models across more than a dozen environments. Ray Serve’s inherent capabilities of per-deployment autoscaling and efficient request routing have enabled 50x cost reductions compared to our previous systems while maintaining high availability and low latency.
But it wasn’t all smooth sailing: Ray Serve was not originally designed for some of the odd ways our system works, and we ran into scaling problems with only a few dozen applications deployed. Thankfully, Ray is open-source, and we were able to dramatically improve the scalability of Ray Serve through a series of contributions we made back to the community. Today, Ray Serve can handle thousands of applications per cluster.
In this talk, we’ll discuss how we use Ray Serve to implement model serving at Workday. We’ll dive deep into our off-the-beaten-path usage of Ray Serve, the challenges we faced with scaling up our system, and the improvements we’ve contributed back to Ray to solve them. You’ll come away from this talk with a deeper understanding of how Ray Serve works and how to build complex systems on top of it – and maybe even contribute back to it!
Read more
Workday uses a tenanted, regionalized architecture in order to ensure data isolation and in-region execution, both of which are crucial requirements for our customers. In early 2023, facing challenges with the ever-increasing scale and cost required to serve dedicated ML models for every tenant in every environment, we decided to completely redo how we serve models using a hot new technology: Ray! We now use Ray Serve to serve tens of thousands of ML models across more than a dozen environments. Ray Serve’s inherent capabilities of per-deployment autoscaling and efficient request routing have enabled 50x cost reductions compared to our previous systems while maintaining high availability and low latency.
But it wasn’t all smooth sailing: Ray Serve was not originally designed for some of the odd ways our system works, and we ran into scaling problems with only a few dozen applications deployed. Thankfully, Ray is open-source, and we were able to dramatically improve the scalability of Ray Serve through a series of contributions we made back to the community. Today, Ray Serve can handle thousands of applications per cluster.
In this talk, we’ll discuss how we use Ray Serve to implement model serving at Workday. We’ll dive deep into our off-the-beaten-path usage of Ray Serve, the challenges we faced with scaling up our system, and the improvements we’ve contributed back to Ray to solve them. You’ll come away from this talk with a deeper understanding of how Ray Serve works and how to build complex systems on top of it – and maybe even contribute back to it!
But it wasn’t all smooth sailing: Ray Serve was not originally designed for some of the odd ways our system works, and we ran into scaling problems with only a few dozen applications deployed. Thankfully, Ray is open-source, and we were able to dramatically improve the scalability of Ray Serve through a series of contributions we made back to the community. Today, Ray Serve can handle thousands of applications per cluster.
In this talk, we’ll discuss how we use Ray Serve to implement model serving at Workday. We’ll dive deep into our off-the-beaten-path usage of Ray Serve, the challenges we faced with scaling up our system, and the improvements we’ve contributed back to Ray to solve them. You’ll come away from this talk with a deeper understanding of how Ray Serve works and how to build complex systems on top of it – and maybe even contribute back to it!
01:45 PM - 02:15 PMGolden Gate C2
vLLM
Text / Docs
FlashInfer: Accelerating LLM Inference Through Unified High-Performance Kernels
As large language models evolve rapidly, the inference ecosystem faces a critical challenge: delivering optimized kernels that maximize hardware efficiency while keeping pace with emerging model architectures. This talk introduces FlashInfer, NVIDIA's strategic initiative to centralize inference kernel development and distribution, addressing the ecosystem's need for performance specialization and development agility.
FlashInfer has demonstrated significant market adoption by leading frameworks including vLLM. Built on open-source principles, FlashInfer fosters community collaboration while maintaining NVIDIA's performance leadership, enabling rapid innovation through transparent development and shared contributions. This session explores how NVIDIA is formalizing FlashInfer as our primary kernel inference library, creating a collaborative ecosystem benefiting internal teams and the broader AI community.
Read more
As large language models evolve rapidly, the inference ecosystem faces a critical challenge: delivering optimized kernels that maximize hardware efficiency while keeping pace with emerging model architectures. This talk introduces FlashInfer, NVIDIA's strategic initiative to centralize inference kernel development and distribution, addressing the ecosystem's need for performance specialization and development agility.
FlashInfer has demonstrated significant market adoption by leading frameworks including vLLM. Built on open-source principles, FlashInfer fosters community collaboration while maintaining NVIDIA's performance leadership, enabling rapid innovation through transparent development and shared contributions. This session explores how NVIDIA is formalizing FlashInfer as our primary kernel inference library, creating a collaborative ecosystem benefiting internal teams and the broader AI community.
FlashInfer has demonstrated significant market adoption by leading frameworks including vLLM. Built on open-source principles, FlashInfer fosters community collaboration while maintaining NVIDIA's performance leadership, enabling rapid innovation through transparent development and shared contributions. This session explores how NVIDIA is formalizing FlashInfer as our primary kernel inference library, creating a collaborative ecosystem benefiting internal teams and the broader AI community.
01:45 PM - 02:15 PMGolden Gate C3
Machine Learning
Finance
Structured Data
Building a Model Fitting Framework for Quant Finance with Ray & Anyscale
Quant trading and research teams at Point72/Cubist have diverse needs related to data, models, and their specific use cases. Investing in an on-premise Ray cluster enabled Ray-focused approaches, but adoption has not always been seamless. Challenges emerged around data management (loading, reuse, access), scaling (efficiently performing parallel windowed model training, sometimes on tens of terrabytes of timeseries data), and platform usage (determining how and when to utilize an Anyscale cluster).
The Cubist Core Research Technologies team has developed libraries to tackle these challenges. We'll discuss patterns we found to be successful, those that fell short, and share insights that we believe can be widely applied.
Read more
Quant trading and research teams at Point72/Cubist have diverse needs related to data, models, and their specific use cases. Investing in an on-premise Ray cluster enabled Ray-focused approaches, but adoption has not always been seamless. Challenges emerged around data management (loading, reuse, access), scaling (efficiently performing parallel windowed model training, sometimes on tens of terrabytes of timeseries data), and platform usage (determining how and when to utilize an Anyscale cluster).
The Cubist Core Research Technologies team has developed libraries to tackle these challenges. We'll discuss patterns we found to be successful, those that fell short, and share insights that we believe can be widely applied.
The Cubist Core Research Technologies team has developed libraries to tackle these challenges. We'll discuss patterns we found to be successful, those that fell short, and share insights that we believe can be widely applied.
01:45 PM - 02:15 PMYerba Buena 2-3
Machine Learning
Right-Sized Ray: A Pragmatic Guide to Scaling on Kubernetes for Teams of All Sizes
Are you battling unschedulable pods because a rogue ML job consumed all your cluster's resources? Are idle GPUs burning a hole in your cloud bill? Or are your developers stuck in a YAML swamp just to scale a Python script? It’s time to stop treating AI infrastructure as a series of hacks and start building a stable, efficient, and self-service platform.
Manual deployments are brittle and hard to scale. KubeRay bridges this gap. It introduces a RayCluster resource that teaches Kubernetes how to manage the entire lifecycle—from creation and fault tolerance to scaling and cleanup—as a single, cohesive unit.
This session is the definitive playbook for leveraging KubeRay, whether you're supporting two developers or two hundred. We will show you how to transform Kubernetes from a general-purpose container platform into an intelligent, application-aware backend for distributed AI.
You will leave this talk knowing how to:
Tame the "Noisy Neighbor" Problem: Implement true multi-tenancy with namespace isolation and cgroups to guarantee resource fairness.
Master Workload Placement: Use label-based scheduling within Ray to automatically direct jobs to the right kubernetes nodes to optimize costs without needing to become a kubernetes expert.
Slash Cloud Costs with Smart Scaling: Configure IPPR with the Ray v2 autoscaler for faster pod scheduling that scales up as needed without requiring costly over-provisioning.
Best Practices for a Unified Platform: Recommended best practices to setup and manage multiple, purpose-built Ray clusters on a single kubernetes control plane
Read more
Are you battling unschedulable pods because a rogue ML job consumed all your cluster's resources? Are idle GPUs burning a hole in your cloud bill? Or are your developers stuck in a YAML swamp just to scale a Python script? It’s time to stop treating AI infrastructure as a series of hacks and start building a stable, efficient, and self-service platform.
Manual deployments are brittle and hard to scale. KubeRay bridges this gap. It introduces a RayCluster resource that teaches Kubernetes how to manage the entire lifecycle—from creation and fault tolerance to scaling and cleanup—as a single, cohesive unit.
This session is the definitive playbook for leveraging KubeRay, whether you're supporting two developers or two hundred. We will show you how to transform Kubernetes from a general-purpose container platform into an intelligent, application-aware backend for distributed AI.
You will leave this talk knowing how to:
Tame the "Noisy Neighbor" Problem: Implement true multi-tenancy with namespace isolation and cgroups to guarantee resource fairness.
Master Workload Placement: Use label-based scheduling within Ray to automatically direct jobs to the right kubernetes nodes to optimize costs without needing to become a kubernetes expert.
Slash Cloud Costs with Smart Scaling: Configure IPPR with the Ray v2 autoscaler for faster pod scheduling that scales up as needed without requiring costly over-provisioning.
Best Practices for a Unified Platform: Recommended best practices to setup and manage multiple, purpose-built Ray clusters on a single kubernetes control plane
Manual deployments are brittle and hard to scale. KubeRay bridges this gap. It introduces a RayCluster resource that teaches Kubernetes how to manage the entire lifecycle—from creation and fault tolerance to scaling and cleanup—as a single, cohesive unit.
This session is the definitive playbook for leveraging KubeRay, whether you're supporting two developers or two hundred. We will show you how to transform Kubernetes from a general-purpose container platform into an intelligent, application-aware backend for distributed AI.
You will leave this talk knowing how to:
Tame the "Noisy Neighbor" Problem: Implement true multi-tenancy with namespace isolation and cgroups to guarantee resource fairness.
Master Workload Placement: Use label-based scheduling within Ray to automatically direct jobs to the right kubernetes nodes to optimize costs without needing to become a kubernetes expert.
Slash Cloud Costs with Smart Scaling: Configure IPPR with the Ray v2 autoscaler for faster pod scheduling that scales up as needed without requiring costly over-provisioning.
Best Practices for a Unified Platform: Recommended best practices to setup and manage multiple, purpose-built Ray clusters on a single kubernetes control plane
01:45 PM - 02:15 PMYerba Buena Salon 4-6
Ray Deep Dives
Advancing KubeRay: Deepening Ecosystem Integrations for Scalable AI Workloads
Improving user experience has been a central focus of the latest KubeRay developments. In this session, KubeRay maintainers will introduce major RayJob enhancements designed to make running and managing workloads simpler and more reliable, including deletion policies, cron scheduling, sidecar mode, and background status checks. These upgrades streamline the end-to-end job lifecycle for both developers and operators.
We will then explore new capabilities that elevate the KubeRay user experience across the Kubernetes ecosystem. Highlights include updates to the kubectl plugin for smoother user workflows, the redesigned APIServer V2 for a cleaner and more extensible control plane, and KubeRay Metrics to improve observability. We will also cover expanded Kubernetes third party scheduler supports, along with progress on in-place pod resizing integration.
Finally, we will share the roadmap for the upcoming history server, which will enable debugging for dead Ray clusters. Join us to learn how these enhancements are transforming the KubeRay experience—making it more powerful, cohesive, and user-friendly for running AI workloads on Kubernetes at scale.
Read more
Improving user experience has been a central focus of the latest KubeRay developments. In this session, KubeRay maintainers will introduce major RayJob enhancements designed to make running and managing workloads simpler and more reliable, including deletion policies, cron scheduling, sidecar mode, and background status checks. These upgrades streamline the end-to-end job lifecycle for both developers and operators.
We will then explore new capabilities that elevate the KubeRay user experience across the Kubernetes ecosystem. Highlights include updates to the kubectl plugin for smoother user workflows, the redesigned APIServer V2 for a cleaner and more extensible control plane, and KubeRay Metrics to improve observability. We will also cover expanded Kubernetes third party scheduler supports, along with progress on in-place pod resizing integration.
Finally, we will share the roadmap for the upcoming history server, which will enable debugging for dead Ray clusters. Join us to learn how these enhancements are transforming the KubeRay experience—making it more powerful, cohesive, and user-friendly for running AI workloads on Kubernetes at scale.
We will then explore new capabilities that elevate the KubeRay user experience across the Kubernetes ecosystem. Highlights include updates to the kubectl plugin for smoother user workflows, the redesigned APIServer V2 for a cleaner and more extensible control plane, and KubeRay Metrics to improve observability. We will also cover expanded Kubernetes third party scheduler supports, along with progress on in-place pod resizing integration.
Finally, we will share the roadmap for the upcoming history server, which will enable debugging for dead Ray clusters. Join us to learn how these enhancements are transforming the KubeRay experience—making it more powerful, cohesive, and user-friendly for running AI workloads on Kubernetes at scale.
01:45 PM - 02:15 PMYerba Buena Salon 10-12
Ray Deep Dives
Ray Train: Distributed Solutions for Removing Training Bottlenecks
Maximizing GPU utilization is critical for accelerating deep learning workloads, yet many pipelines are limited by bottlenecks around the core training loop. Slow dataloaders, blocking validation steps, and GPU stalls from checkpointing all reduce training throughput. This talk demonstrates how Ray Train removes these barriers through features such as asynchronous checkpointing, async validation, scalable data ingestion with Ray Data, and mid-epoch dataset resumption. We’ll also showcase Ray Train’s observability and performance tooling, including the Train dashboard, profiling, metrics, and logs. Attendees will learn best practices for building high-throughput, production-grade training pipelines that leverage heterogeneous compute.
Read more
Maximizing GPU utilization is critical for accelerating deep learning workloads, yet many pipelines are limited by bottlenecks around the core training loop. Slow dataloaders, blocking validation steps, and GPU stalls from checkpointing all reduce training throughput. This talk demonstrates how Ray Train removes these barriers through features such as asynchronous checkpointing, async validation, scalable data ingestion with Ray Data, and mid-epoch dataset resumption. We’ll also showcase Ray Train’s observability and performance tooling, including the Train dashboard, profiling, metrics, and logs. Attendees will learn best practices for building high-throughput, production-grade training pipelines that leverage heterogeneous compute.
Read more02:15 PM - 02:30 PMLightning Theater
vLLM
Text / Docs
Powering the Future of LLMs: AWS and the vLLM open source project
Amazon is a strong supporter and contributor to vLLM, the leading open source inference engine for serving LLM. vLLM is used across Amazon and enables millions of customers to use the Amazon Rufus shopping assistant. vLLM's support for heterogeneous hardware, including AWS Trainium and NVIDIA GPUs, has enabled deployment of a cost-optimized, multi-node inference architecture. This hybrid approach allows us to route requests to the most appropriate accelerator, leading to infrastructure cost savings without compromising performance. In this session, we'll dive into AWS deployment options with vLLM, our existing open source work streams, and other initiatives.
Read more
Amazon is a strong supporter and contributor to vLLM, the leading open source inference engine for serving LLM. vLLM is used across Amazon and enables millions of customers to use the Amazon Rufus shopping assistant. vLLM's support for heterogeneous hardware, including AWS Trainium and NVIDIA GPUs, has enabled deployment of a cost-optimized, multi-node inference architecture. This hybrid approach allows us to route requests to the most appropriate accelerator, leading to infrastructure cost savings without compromising performance. In this session, we'll dive into AWS deployment options with vLLM, our existing open source work streams, and other initiatives.
Read more02:30 PM - 03:00 PMGolden Gate B
Machine Learning
Scaling RL @ Cursor
Cursor integrates AI into every step of the software creation process. At the heart of this are advanced coding models that power intelligent code completion, understanding, and generation at scale. In this talk, we’ll share our journey and describe the challenges of building our most advanced models.
Read more
Cursor integrates AI into every step of the software creation process. At the heart of this are advanced coding models that power intelligent code completion, understanding, and generation at scale. In this talk, we’ll share our journey and describe the challenges of building our most advanced models.
Read more02:30 PM - 03:00 PMGolden Gate C1
Reinforcement Learning
Scaling Open Distributed Infrastructure + Environments for Agentic RL
This talk surveys the design of key elements of the Prime Intellect infrastructure stack for distributed reinforcement learning, including prime-rl, verifiers, the Environments Hub, and the Prime Compute platform. prime-rl is our async-first RL trainer designed for large-scale distributed runs, including multi-cluster, fault tolerant, and heterogenous pools for inference (enabling e.g. spot compute to be used for rollout workers). prime-rl supports multi-turn environments built with verifiers, which is our library for implementing complex agentic protocols around an OpenAI-compatible API (allowing direct offline evaluation with any model endpoint). For large training runs such as our upcoming INTELLECT-3 model, we source environment implementations from the Environments Hub, which is our community platform for creating and sharing train-ready RL environments as importable Python packages. The Prime Compute platform, our multi-cloud compute marketplace, supports our whole stack top to bottom, from the clusters on which we run training and inference to the sandboxes required for complex agentic environments.
Read more
This talk surveys the design of key elements of the Prime Intellect infrastructure stack for distributed reinforcement learning, including prime-rl, verifiers, the Environments Hub, and the Prime Compute platform. prime-rl is our async-first RL trainer designed for large-scale distributed runs, including multi-cluster, fault tolerant, and heterogenous pools for inference (enabling e.g. spot compute to be used for rollout workers). prime-rl supports multi-turn environments built with verifiers, which is our library for implementing complex agentic protocols around an OpenAI-compatible API (allowing direct offline evaluation with any model endpoint). For large training runs such as our upcoming INTELLECT-3 model, we source environment implementations from the Environments Hub, which is our community platform for creating and sharing train-ready RL environments as importable Python packages. The Prime Compute platform, our multi-cloud compute marketplace, supports our whole stack top to bottom, from the clusters on which we run training and inference to the sandboxes required for complex agentic environments.
Read more02:30 PM - 03:00 PMGolden Gate C2
vLLM
Text / Docs
Scaling LLM Inference with RayServe & vLLM: Building a Serverless Internal, Enterprise Model Hosting Platform
In this talk, we'll share how we built an internal, enterprise, serverless model hosting platform using RayServe and vLLM—powering fast, scalable LLM inference across teams. Drawing inspiration from best-in-class industry solutions, our platform empowers users to deploy and manage models through a streamlined, self-service interface. We’ll dive into the key capabilities we’ve layered on top, including challenges and solutions around multi-tenancy, auto-scaling, token-level budgeting, request observability, and fine-grained resource controls. Whether you're building for internal developers or external customers, this session will show how RayServe and vLLM can be combined to deliver reliable, production-grade model inference at scale.
Read more
In this talk, we'll share how we built an internal, enterprise, serverless model hosting platform using RayServe and vLLM—powering fast, scalable LLM inference across teams. Drawing inspiration from best-in-class industry solutions, our platform empowers users to deploy and manage models through a streamlined, self-service interface. We’ll dive into the key capabilities we’ve layered on top, including challenges and solutions around multi-tenancy, auto-scaling, token-level budgeting, request observability, and fine-grained resource controls. Whether you're building for internal developers or external customers, this session will show how RayServe and vLLM can be combined to deliver reliable, production-grade model inference at scale.
Read more02:30 PM - 03:00 PMGolden Gate C3
Machine Learning
Image
Structured Data
Video
Scalable High-Performance Multi-Modal Data Curation with NVIDIA NeMo Curator
Processing petabyte-scale, multi-modal data for Generative AI - spanning text, video, audio, and more—is a complex distributed systems challenge. These pipelines require a framework capable of handling heterogeneous workloads, stateful operations like deduplication, and GPU acceleration. This session explores architectural patterns for building such pipelines using Ray.
Drawing on our experience building NVIDIA NeMo Curator - we demonstrate how Ray’s primitives enable efficient, scalable data processing. We will cover how to leverage Ray Actors for stateful, long-running tasks and Ray Tasks for stateless parallel transformations, managing heterogeneous CPU/GPU resources to maximize throughput and pipeline robustness.
Read more
Processing petabyte-scale, multi-modal data for Generative AI - spanning text, video, audio, and more—is a complex distributed systems challenge. These pipelines require a framework capable of handling heterogeneous workloads, stateful operations like deduplication, and GPU acceleration. This session explores architectural patterns for building such pipelines using Ray.
Drawing on our experience building NVIDIA NeMo Curator - we demonstrate how Ray’s primitives enable efficient, scalable data processing. We will cover how to leverage Ray Actors for stateful, long-running tasks and Ray Tasks for stateless parallel transformations, managing heterogeneous CPU/GPU resources to maximize throughput and pipeline robustness.
Drawing on our experience building NVIDIA NeMo Curator - we demonstrate how Ray’s primitives enable efficient, scalable data processing. We will cover how to leverage Ray Actors for stateful, long-running tasks and Ray Tasks for stateless parallel transformations, managing heterogeneous CPU/GPU resources to maximize throughput and pipeline robustness.
02:30 PM - 03:00 PMYerba Buena 2-3
Machine Learning
Media & Gaming
Text / Docs
Image
JIT-Embedding with Ray Serve: Accelerating Large-Scale GenAI Foundation Model Training in Adobe Firefly
This presentation introduces JIT-Embedding (Just-in-Time Embedding), a novel solution designed to accelerate the training of foundational Generative AI (GenAI) models, with a focus on image and video generation in Adobe Firefly. By decoupling the expensive embedding computation from model training, JIT-Embedding enables these processes to scale independently. Built on Ray Serve, our architecture includes a robust JIT Service and JIT Client, seamlessly integrated with our Model Hub and Dataloader. The experiment results demonstrate that this approach significantly improved scalability, enabled higher-resolution and larger-scale GenAI Foundation Model training, and achieved notable performance gains and cost reductions. It's one of the innovations contributing to Firefly Video Model public release.
JIT-Embedding addresses several key challenges in large-scale foundation diffusion model training:
1. Slow on-the-fly embedding computation during training (e.g., VAE, CLIP, and T5 embeddings).
2. Long turnaround time required for offline embedding pre-computation.
3. High cost associated with recomputing embeddings using either approach.
4. Severe GPU memory constraints when training large models or processing high-resolution images/videos.
Our solution introduces several innovations to mitigate these issues:
1. JIT Service via Ray Serve: Wraps embedding computation as an on-demand service, deployable on underutilized lower-tier GPUs (e.g. A100), freeing up high-end GPUs (H100) for model training and optimizing resource allocation. The GPU memory requirements drop significantly for both side.
2. JIT Client with Dataloader Integration: Uses multiprocessing and prefetching to overlap embedding requests with training, effectively hiding latency of on-the-fly embedding computation and maximizing GPU utilization.
3. Efficient Serialization/Deserialization: We created a Rust + Python library, inspired by functional programming, to efficiently compresses multimodal data (e.g., images, videos, long text) to improve server–client communication throughput and flexibility.
4. Advanced Performance Optimization: Combines Ray Serve’s dashboards with our custom metrics, profiling, and load testing tools. We leverage advanced Ray features such as autoscaling, dynamic batching, and in-place model updates. Moreover, we came up with some key optimizations include client-side load balancing, faster video/image codecs in rust, overlapping CPU/GPU ops, and shared GPU usage across multiple models.
5. JIT Cache: Automatically stores computed embeddings for reuse across future training jobs, further reducing cost and computation time.
We plan to open source the JIT-Embedding solution, including the services, clients and Serialization/Deserialization library.
This talk will provide a comprehensive overview of the JIT-Embedding architecture, including the design of the JIT Service, JIT Client, Serialization/Deserialization, and the caching mechanism. We will present end-to-end experimental results from large-scale model training, showcasing the system’s scalability, performance enhancements, and cost efficiency. The session will conclude with key takeaways from our journey with Ray Serve and future directions for continued optimization.
Read more
This presentation introduces JIT-Embedding (Just-in-Time Embedding), a novel solution designed to accelerate the training of foundational Generative AI (GenAI) models, with a focus on image and video generation in Adobe Firefly. By decoupling the expensive embedding computation from model training, JIT-Embedding enables these processes to scale independently. Built on Ray Serve, our architecture includes a robust JIT Service and JIT Client, seamlessly integrated with our Model Hub and Dataloader. The experiment results demonstrate that this approach significantly improved scalability, enabled higher-resolution and larger-scale GenAI Foundation Model training, and achieved notable performance gains and cost reductions. It's one of the innovations contributing to Firefly Video Model public release.
JIT-Embedding addresses several key challenges in large-scale foundation diffusion model training:
1. Slow on-the-fly embedding computation during training (e.g., VAE, CLIP, and T5 embeddings).
2. Long turnaround time required for offline embedding pre-computation.
3. High cost associated with recomputing embeddings using either approach.
4. Severe GPU memory constraints when training large models or processing high-resolution images/videos.
Our solution introduces several innovations to mitigate these issues:
1. JIT Service via Ray Serve: Wraps embedding computation as an on-demand service, deployable on underutilized lower-tier GPUs (e.g. A100), freeing up high-end GPUs (H100) for model training and optimizing resource allocation. The GPU memory requirements drop significantly for both side.
2. JIT Client with Dataloader Integration: Uses multiprocessing and prefetching to overlap embedding requests with training, effectively hiding latency of on-the-fly embedding computation and maximizing GPU utilization.
3. Efficient Serialization/Deserialization: We created a Rust + Python library, inspired by functional programming, to efficiently compresses multimodal data (e.g., images, videos, long text) to improve server–client communication throughput and flexibility.
4. Advanced Performance Optimization: Combines Ray Serve’s dashboards with our custom metrics, profiling, and load testing tools. We leverage advanced Ray features such as autoscaling, dynamic batching, and in-place model updates. Moreover, we came up with some key optimizations include client-side load balancing, faster video/image codecs in rust, overlapping CPU/GPU ops, and shared GPU usage across multiple models.
5. JIT Cache: Automatically stores computed embeddings for reuse across future training jobs, further reducing cost and computation time.
We plan to open source the JIT-Embedding solution, including the services, clients and Serialization/Deserialization library.
This talk will provide a comprehensive overview of the JIT-Embedding architecture, including the design of the JIT Service, JIT Client, Serialization/Deserialization, and the caching mechanism. We will present end-to-end experimental results from large-scale model training, showcasing the system’s scalability, performance enhancements, and cost efficiency. The session will conclude with key takeaways from our journey with Ray Serve and future directions for continued optimization.
JIT-Embedding addresses several key challenges in large-scale foundation diffusion model training:
1. Slow on-the-fly embedding computation during training (e.g., VAE, CLIP, and T5 embeddings).
2. Long turnaround time required for offline embedding pre-computation.
3. High cost associated with recomputing embeddings using either approach.
4. Severe GPU memory constraints when training large models or processing high-resolution images/videos.
Our solution introduces several innovations to mitigate these issues:
1. JIT Service via Ray Serve: Wraps embedding computation as an on-demand service, deployable on underutilized lower-tier GPUs (e.g. A100), freeing up high-end GPUs (H100) for model training and optimizing resource allocation. The GPU memory requirements drop significantly for both side.
2. JIT Client with Dataloader Integration: Uses multiprocessing and prefetching to overlap embedding requests with training, effectively hiding latency of on-the-fly embedding computation and maximizing GPU utilization.
3. Efficient Serialization/Deserialization: We created a Rust + Python library, inspired by functional programming, to efficiently compresses multimodal data (e.g., images, videos, long text) to improve server–client communication throughput and flexibility.
4. Advanced Performance Optimization: Combines Ray Serve’s dashboards with our custom metrics, profiling, and load testing tools. We leverage advanced Ray features such as autoscaling, dynamic batching, and in-place model updates. Moreover, we came up with some key optimizations include client-side load balancing, faster video/image codecs in rust, overlapping CPU/GPU ops, and shared GPU usage across multiple models.
5. JIT Cache: Automatically stores computed embeddings for reuse across future training jobs, further reducing cost and computation time.
We plan to open source the JIT-Embedding solution, including the services, clients and Serialization/Deserialization library.
This talk will provide a comprehensive overview of the JIT-Embedding architecture, including the design of the JIT Service, JIT Client, Serialization/Deserialization, and the caching mechanism. We will present end-to-end experimental results from large-scale model training, showcasing the system’s scalability, performance enhancements, and cost efficiency. The session will conclude with key takeaways from our journey with Ray Serve and future directions for continued optimization.
02:30 PM - 03:00 PMYerba Buena Salon 4-6
Ray Deep Dives
Reliability at Scale: Fault-tolerant 10K+ Node Ray Clusters on the Anyscale Runtime
Today’s AI workloads push the boundaries of software infrastructure. At scale, network flakiness, spot preemptions, hardware failures, and resource contention become inevitable. We’ll discuss how Ray solves these problems, enabling applications to run reliably on clusters of 10,000+ nodes despite these challenges, and dig into the engineering lessons learned along the way. Finally, we’ll outline what’s next for Ray in terms of scalability and reliability to power tomorrow’s AI workloads.
Read more
Today’s AI workloads push the boundaries of software infrastructure. At scale, network flakiness, spot preemptions, hardware failures, and resource contention become inevitable. We’ll discuss how Ray solves these problems, enabling applications to run reliably on clusters of 10,000+ nodes despite these challenges, and dig into the engineering lessons learned along the way. Finally, we’ll outline what’s next for Ray in terms of scalability and reliability to power tomorrow’s AI workloads.
Read more02:30 PM - 03:00 PMYerba Buena Salon 10-12
Reinforcement Learning
Physical AI
Research
Structured Data
End‑to‑End Hybrid Reinforcement and Imitation Learning for Robotics with Ray
The nature of machine learning in robotics demands complex abstractions of hardware and training/simulation layers to use a combination of RL and IL (Imitation Learning). In this respect, policy learning for robotics rarely fits on one kind of machine. For instance, massive simulation parallelization with GPU physics and rendering in Isaac Lab demand RTX‑class GPUs, while policy training benefits from large VRAM and FLOPs. Over the past year we have built our infra on Ray that hides this hardware/software diversity and lets researchers focus on science, not sys‑admin.
Our platform offers:
- Unified orchestration – a single Ray workflow allows to train full state RL models that are used to train multi-task IL policy, and evaluation in simulation.
- Heterogeneous GPU scheduling – placement groups assign Isaac Lab simulators to RTX workers and gradient computation to A100/H100 trainers without manual mapping.
- Isolated deployment targets – the same job definition that trains a policy can package it into a lightweight Ray Serve micro‑service that runs next to the robot or on a nearby edge server, shielding control code from research churn.
During the live demo we will:
- Launch a hybrid RL‑IL run that automatically provisions both Nvidia-RTX GPUs and A100/H100 nodes.
- Watch Ray adapt the cluster as workloads shift from simulation to learning to evaluation.
- Deploy the resulting policy to an isolated runtime on the robot—ready for immediate testing.
Attendees will leave with practical design patterns for juggling simulator‑heavy and large‑scale network training inside one reproducible Ray ecosystem, plus insights on meeting real‑time robotics constraints while remaining GPU‑efficient.
Read more
The nature of machine learning in robotics demands complex abstractions of hardware and training/simulation layers to use a combination of RL and IL (Imitation Learning). In this respect, policy learning for robotics rarely fits on one kind of machine. For instance, massive simulation parallelization with GPU physics and rendering in Isaac Lab demand RTX‑class GPUs, while policy training benefits from large VRAM and FLOPs. Over the past year we have built our infra on Ray that hides this hardware/software diversity and lets researchers focus on science, not sys‑admin.
Our platform offers:
- Unified orchestration – a single Ray workflow allows to train full state RL models that are used to train multi-task IL policy, and evaluation in simulation.
- Heterogeneous GPU scheduling – placement groups assign Isaac Lab simulators to RTX workers and gradient computation to A100/H100 trainers without manual mapping.
- Isolated deployment targets – the same job definition that trains a policy can package it into a lightweight Ray Serve micro‑service that runs next to the robot or on a nearby edge server, shielding control code from research churn.
During the live demo we will:
- Launch a hybrid RL‑IL run that automatically provisions both Nvidia-RTX GPUs and A100/H100 nodes.
- Watch Ray adapt the cluster as workloads shift from simulation to learning to evaluation.
- Deploy the resulting policy to an isolated runtime on the robot—ready for immediate testing.
Attendees will leave with practical design patterns for juggling simulator‑heavy and large‑scale network training inside one reproducible Ray ecosystem, plus insights on meeting real‑time robotics constraints while remaining GPU‑efficient.
Our platform offers:
- Unified orchestration – a single Ray workflow allows to train full state RL models that are used to train multi-task IL policy, and evaluation in simulation.
- Heterogeneous GPU scheduling – placement groups assign Isaac Lab simulators to RTX workers and gradient computation to A100/H100 trainers without manual mapping.
- Isolated deployment targets – the same job definition that trains a policy can package it into a lightweight Ray Serve micro‑service that runs next to the robot or on a nearby edge server, shielding control code from research churn.
During the live demo we will:
- Launch a hybrid RL‑IL run that automatically provisions both Nvidia-RTX GPUs and A100/H100 nodes.
- Watch Ray adapt the cluster as workloads shift from simulation to learning to evaluation.
- Deploy the resulting policy to an isolated runtime on the robot—ready for immediate testing.
Attendees will leave with practical design patterns for juggling simulator‑heavy and large‑scale network training inside one reproducible Ray ecosystem, plus insights on meeting real‑time robotics constraints while remaining GPU‑efficient.
03:00 PM - 03:15 PMLightning Theater
Reinforcement Learning
Accelerating SFT, RL, and Inference in the Context Layer for Enterprise AI with Ray
Contextual AI builds enterprise-grade AI agents and applications. Using Ray, we’ve developed a scalable training and serving platform that accelerates supervised fine-tuning (SFT), reinforcement learning (RL), and low-latency inference across multi-node clusters.
In this talk, we’ll share our architecture and lessons learned: asynchronous RL pipelines, multi-turn training, LoRA-based adaptation, context/data/tensor parallelism, autoscaling and cold-start, latency-aware routing, and disaggregated pre-fill/decoding. We’ll also cover observability (logging, metrics, alerts), multi-host deployment, and reliability at scale. How we leverage building and operating enterprise AI agents on Ray—from training to production serving.
Read more
Contextual AI builds enterprise-grade AI agents and applications. Using Ray, we’ve developed a scalable training and serving platform that accelerates supervised fine-tuning (SFT), reinforcement learning (RL), and low-latency inference across multi-node clusters.
In this talk, we’ll share our architecture and lessons learned: asynchronous RL pipelines, multi-turn training, LoRA-based adaptation, context/data/tensor parallelism, autoscaling and cold-start, latency-aware routing, and disaggregated pre-fill/decoding. We’ll also cover observability (logging, metrics, alerts), multi-host deployment, and reliability at scale. How we leverage building and operating enterprise AI agents on Ray—from training to production serving.
In this talk, we’ll share our architecture and lessons learned: asynchronous RL pipelines, multi-turn training, LoRA-based adaptation, context/data/tensor parallelism, autoscaling and cold-start, latency-aware routing, and disaggregated pre-fill/decoding. We’ll also cover observability (logging, metrics, alerts), multi-host deployment, and reliability at scale. How we leverage building and operating enterprise AI agents on Ray—from training to production serving.
03:15 PM - 03:45 PMGolden Gate A
Machine Learning
Physical AI
Video
Optimizing Video AI at Scale: Cost-Effective ML Operations with Geotab and Anyscale Ray
Processing and deriving intelligence from billions of frames of video data captured by Geotab cameras can be a resource-intensive task. This presentation will share Geotab's journey of building a cost-efficient and highly automated Smart Video Platform utilizing Anyscale Ray.
We will showcase how Ray serves as the backbone for hosting and orchestrating our machine learning models for video analysis, enabling both efficient real-time inference and batch processing.
A key focus will be on our automated training and validation workflows, which leverage Ray's distributed capabilities to dramatically reduce the time and cost associated with model development and deployment. Learn how Geotab is achieving significant operational savings and accelerating innovation in video analytics through a strategic embrace of Anyscale Ray.
Read more
Processing and deriving intelligence from billions of frames of video data captured by Geotab cameras can be a resource-intensive task. This presentation will share Geotab's journey of building a cost-efficient and highly automated Smart Video Platform utilizing Anyscale Ray.
We will showcase how Ray serves as the backbone for hosting and orchestrating our machine learning models for video analysis, enabling both efficient real-time inference and batch processing.
A key focus will be on our automated training and validation workflows, which leverage Ray's distributed capabilities to dramatically reduce the time and cost associated with model development and deployment. Learn how Geotab is achieving significant operational savings and accelerating innovation in video analytics through a strategic embrace of Anyscale Ray.
We will showcase how Ray serves as the backbone for hosting and orchestrating our machine learning models for video analysis, enabling both efficient real-time inference and batch processing.
A key focus will be on our automated training and validation workflows, which leverage Ray's distributed capabilities to dramatically reduce the time and cost associated with model development and deployment. Learn how Geotab is achieving significant operational savings and accelerating innovation in video analytics through a strategic embrace of Anyscale Ray.
03:15 PM - 03:45 PMGolden Gate B
LLMs
Research
Text / Docs
Marin: Open Development of Open Foundation Models
Open-source software thrives because its entire lifecycle remains public: code, tests, even missteps. Foundation models rarely meet that bar: most “open-weight” releases omit the training code, data recipe, and logs needed for reproducibility.
Marin closes that gap. Every run begins as a GitHub pull request that defines the hypothesis and pins the config. Ray orchestrates the job across preemptible Google Cloud TPUs, streaming metrics and depositing artifacts tied exactly to the commit. Successes, failures, and restarts remain visible, not hidden.
With this workflow we trained Marin-8B, which outperforms Llama 3.1 8B Base on 14 of 19 benchmarks. We will share lessons from scaling to 32B parameters, training MoEs on preemptible hardware, and building a Ray-based RL pipeline for agentic models, focusing on autoscaling, fault tolerance, and dataset pipelines. We will also highlight ways to get involved, from optimizers and data curation to following training logs live.
Read more
Open-source software thrives because its entire lifecycle remains public: code, tests, even missteps. Foundation models rarely meet that bar: most “open-weight” releases omit the training code, data recipe, and logs needed for reproducibility.
Marin closes that gap. Every run begins as a GitHub pull request that defines the hypothesis and pins the config. Ray orchestrates the job across preemptible Google Cloud TPUs, streaming metrics and depositing artifacts tied exactly to the commit. Successes, failures, and restarts remain visible, not hidden.
With this workflow we trained Marin-8B, which outperforms Llama 3.1 8B Base on 14 of 19 benchmarks. We will share lessons from scaling to 32B parameters, training MoEs on preemptible hardware, and building a Ray-based RL pipeline for agentic models, focusing on autoscaling, fault tolerance, and dataset pipelines. We will also highlight ways to get involved, from optimizers and data curation to following training logs live.
Marin closes that gap. Every run begins as a GitHub pull request that defines the hypothesis and pins the config. Ray orchestrates the job across preemptible Google Cloud TPUs, streaming metrics and depositing artifacts tied exactly to the commit. Successes, failures, and restarts remain visible, not hidden.
With this workflow we trained Marin-8B, which outperforms Llama 3.1 8B Base on 14 of 19 benchmarks. We will share lessons from scaling to 32B parameters, training MoEs on preemptible hardware, and building a Ray-based RL pipeline for agentic models, focusing on autoscaling, fault tolerance, and dataset pipelines. We will also highlight ways to get involved, from optimizers and data curation to following training logs live.
03:15 PM - 03:45 PMGolden Gate C1
Machine Learning
Structured Data
Breaking the Dataset Iteration Bottleneck: Real-Time ML Experimentation with Ray
At Pinterest, iterating on dataset curation and label generation consistently improves our recommendation models, but this process is severely constrained by expensive and time-consuming data generation workflows. When experimenting with new sampling strategies, features, or labels, teams face a critical choice: either backfill long-running jobs that strain compute resources and budget, or wait weeks for experimental datasets to naturally populate with new data. This creates a fundamental barrier to data-driven model improvement, where a single dataset iteration either costs thousands of dollars and requires a tedious monitoring of the backfill process, or takes weeks of waiting. In either case, the developer velocity is severely impacted.
Two pivotal use-cases within Pinterest exemplify these challenges, namely the Dataset Sampling Strategy Exploration and the Generation of Labels for Downstream Engagement Modeling. Sampling is fundamental for creating training datasets from massive data repositories. Our sampling strategy determines the composition and quality of resulting datasets, thus the resulting model, yet iterating on these strategies is prohibitively difficult and expensive. The current data generation workflows also prevent adoption of sophisticated techniques like score-based negative sampling that requires real-time computation during training. Downstream engagement labels present a similarly complex challenge. Unlike immediate action labels, these labels focus on driving long-term user engagement rather than instant responses. The complexity increases because each label involves multiple tunable hyperparameters (e.g. engagement decay) creating a vast search space. In both cases, teams would ideally conduct hyperparameter tuning to systematically explore these vast search spaces and identify optimal configurations, but the current data pipeline architecture makes such comprehensive exploration prohibitively expensive and time-consuming.
To address these limitations, we shifted both use-cases from static dataset generation to a streaming paradigm using Ray that enables truly iterative experimentation, moving sampling and label generation logic directly into the training dataloader to process data in real-time. This eliminates the costly choice between expensive backfills and weeks of waiting, while enabling comprehensive hyperparameter exploration. The impact spans both domains: sampling changes on ranking models now ship with 10x faster development time, while downstream engagement label experimentation has been reduced from 6 weeks to 3 days and adopted by multiple teams. The solution's power is fully realized during productization, where teams must simultaneously optimize both label generation parameters and sampling strategies - our unified approach handles both seamlessly within the same pipeline. Combined with Ray's bucket join capabilities that enable joining large embedding features and multiday datasets previously impossible due to cost and compute constraints, this has saved over hundreds of thousands dollars in costs while transforming dataset iteration from a fundamental bottleneck into an enabler of rapid experimentation.
Read more
At Pinterest, iterating on dataset curation and label generation consistently improves our recommendation models, but this process is severely constrained by expensive and time-consuming data generation workflows. When experimenting with new sampling strategies, features, or labels, teams face a critical choice: either backfill long-running jobs that strain compute resources and budget, or wait weeks for experimental datasets to naturally populate with new data. This creates a fundamental barrier to data-driven model improvement, where a single dataset iteration either costs thousands of dollars and requires a tedious monitoring of the backfill process, or takes weeks of waiting. In either case, the developer velocity is severely impacted.
Two pivotal use-cases within Pinterest exemplify these challenges, namely the Dataset Sampling Strategy Exploration and the Generation of Labels for Downstream Engagement Modeling. Sampling is fundamental for creating training datasets from massive data repositories. Our sampling strategy determines the composition and quality of resulting datasets, thus the resulting model, yet iterating on these strategies is prohibitively difficult and expensive. The current data generation workflows also prevent adoption of sophisticated techniques like score-based negative sampling that requires real-time computation during training. Downstream engagement labels present a similarly complex challenge. Unlike immediate action labels, these labels focus on driving long-term user engagement rather than instant responses. The complexity increases because each label involves multiple tunable hyperparameters (e.g. engagement decay) creating a vast search space. In both cases, teams would ideally conduct hyperparameter tuning to systematically explore these vast search spaces and identify optimal configurations, but the current data pipeline architecture makes such comprehensive exploration prohibitively expensive and time-consuming.
To address these limitations, we shifted both use-cases from static dataset generation to a streaming paradigm using Ray that enables truly iterative experimentation, moving sampling and label generation logic directly into the training dataloader to process data in real-time. This eliminates the costly choice between expensive backfills and weeks of waiting, while enabling comprehensive hyperparameter exploration. The impact spans both domains: sampling changes on ranking models now ship with 10x faster development time, while downstream engagement label experimentation has been reduced from 6 weeks to 3 days and adopted by multiple teams. The solution's power is fully realized during productization, where teams must simultaneously optimize both label generation parameters and sampling strategies - our unified approach handles both seamlessly within the same pipeline. Combined with Ray's bucket join capabilities that enable joining large embedding features and multiday datasets previously impossible due to cost and compute constraints, this has saved over hundreds of thousands dollars in costs while transforming dataset iteration from a fundamental bottleneck into an enabler of rapid experimentation.
Two pivotal use-cases within Pinterest exemplify these challenges, namely the Dataset Sampling Strategy Exploration and the Generation of Labels for Downstream Engagement Modeling. Sampling is fundamental for creating training datasets from massive data repositories. Our sampling strategy determines the composition and quality of resulting datasets, thus the resulting model, yet iterating on these strategies is prohibitively difficult and expensive. The current data generation workflows also prevent adoption of sophisticated techniques like score-based negative sampling that requires real-time computation during training. Downstream engagement labels present a similarly complex challenge. Unlike immediate action labels, these labels focus on driving long-term user engagement rather than instant responses. The complexity increases because each label involves multiple tunable hyperparameters (e.g. engagement decay) creating a vast search space. In both cases, teams would ideally conduct hyperparameter tuning to systematically explore these vast search spaces and identify optimal configurations, but the current data pipeline architecture makes such comprehensive exploration prohibitively expensive and time-consuming.
To address these limitations, we shifted both use-cases from static dataset generation to a streaming paradigm using Ray that enables truly iterative experimentation, moving sampling and label generation logic directly into the training dataloader to process data in real-time. This eliminates the costly choice between expensive backfills and weeks of waiting, while enabling comprehensive hyperparameter exploration. The impact spans both domains: sampling changes on ranking models now ship with 10x faster development time, while downstream engagement label experimentation has been reduced from 6 weeks to 3 days and adopted by multiple teams. The solution's power is fully realized during productization, where teams must simultaneously optimize both label generation parameters and sampling strategies - our unified approach handles both seamlessly within the same pipeline. Combined with Ray's bucket join capabilities that enable joining large embedding features and multiday datasets previously impossible due to cost and compute constraints, this has saved over hundreds of thousands dollars in costs while transforming dataset iteration from a fundamental bottleneck into an enabler of rapid experimentation.
03:15 PM - 03:45 PMGolden Gate C2
vLLM
Elastic Expert Parallelism for vLLM
Large-scale Expert Parallelism (EP) serves as a key enabler for efficient inference of Mixture-of-Experts (MoE) models. However, existing inter-instance level auto-scaling mechanisms commonly used for LLM serving struggle with the monolithic deployment units required by EP—for example, DeepSeek R3/V1 requires 144 GPUs as a basic scaling unit. In this session, we will share how intra-instance Elastic EP empowers vLLM with low-latency, minimal-downtime, fine-grained autoscaling for serving MoE models, enabling tight match of workload and serving resources. We leverage Ray to orchestrate elastic EP scaling.
Read more
Large-scale Expert Parallelism (EP) serves as a key enabler for efficient inference of Mixture-of-Experts (MoE) models. However, existing inter-instance level auto-scaling mechanisms commonly used for LLM serving struggle with the monolithic deployment units required by EP—for example, DeepSeek R3/V1 requires 144 GPUs as a basic scaling unit. In this session, we will share how intra-instance Elastic EP empowers vLLM with low-latency, minimal-downtime, fine-grained autoscaling for serving MoE models, enabling tight match of workload and serving resources. We leverage Ray to orchestrate elastic EP scaling.
Read more03:15 PM - 03:45 PMGolden Gate C3
Machine Learning
Media & Gaming
Structured Data
Mako: Netflix's Next Generation ML Training Platform
At Netflix, we are building Mako, a new ML training platform designed to meet the demands of modern AI workloads. In this talk, we will share how we evolved our training platform, improved GPU efficiency using a custom scheduler, and made key architecture changes to support large-scale training. We will also cover how Ray fits into this journey and what we learned along the way.
Read more
At Netflix, we are building Mako, a new ML training platform designed to meet the demands of modern AI workloads. In this talk, we will share how we evolved our training platform, improved GPU efficiency using a custom scheduler, and made key architecture changes to support large-scale training. We will also cover how Ray fits into this journey and what we learned along the way.
Read more03:15 PM - 03:45 PMYerba Buena 2-3
vLLM
CoServe: Max performance, minimal compute
Cohere is committed to build a scalable and efficient all-in-one platform for private and secure AI solutions for enterprises. On top of the vLLM library, we combine accuracy-preserving low-bit quantization solutions with extensive kernel and data communication optimization to deliver high-performance low-latency inference services with minimal compute cost. For example, our foundation model Command A series can be served with single h100 GPU at low latency while supporting more than 128K context length.
Read more
Cohere is committed to build a scalable and efficient all-in-one platform for private and secure AI solutions for enterprises. On top of the vLLM library, we combine accuracy-preserving low-bit quantization solutions with extensive kernel and data communication optimization to deliver high-performance low-latency inference services with minimal compute cost. For example, our foundation model Command A series can be served with single h100 GPU at low latency while supporting more than 128K context length.
Read more03:15 PM - 03:45 PMYerba Buena Salon 4-6
Ray Deep Dives
Ray Data: Data Processing for AI workloads
Ray Data is one of the most popular libraries in the Ray ecosystem. Unlike other data processing engines, Ray Data is built to target emerging AI workloads that are multimodal, accelerator-native, and AI centric. In this talk, we'll overview Ray Data's key features and talk about core features we've added to Ray Data to support large-scale batch inference, distributed training preparation and ingest, and multimodal data processing.
Read more
Ray Data is one of the most popular libraries in the Ray ecosystem. Unlike other data processing engines, Ray Data is built to target emerging AI workloads that are multimodal, accelerator-native, and AI centric. In this talk, we'll overview Ray Data's key features and talk about core features we've added to Ray Data to support large-scale batch inference, distributed training preparation and ingest, and multimodal data processing.
Read more03:15 PM - 03:45 PMYerba Buena Salon 10-12
Ray Deep Dives
Ray Direct Transport: RDMA Support in Ray Core
GPU workloads on Ray often hit a hidden bottleneck: every tensor passed between tasks takes a costly trip through CPU memory and serialization to Ray's object store. Ray Direct Transport (RDT) is a new feature in Ray Core that eliminates this overhead by keeping GPU data on the device and transferring directly between actors via RDMA—no unnecessary copies, no serialization.
Powered by high-performance backends like NCCL, Gloo, and RDMA, RDT enables easy and efficient scaling of cutting-edge workloads like reinforcement learning for LLMs and disaggregated multimodal training. In this talk, we’ll show how RDT integrates seamlessly with the familiar Ray ObjectRef API, the architecture behind RDT, and demonstrate how it unlocks fast and flexible distributed GPU programming.
Read more
GPU workloads on Ray often hit a hidden bottleneck: every tensor passed between tasks takes a costly trip through CPU memory and serialization to Ray's object store. Ray Direct Transport (RDT) is a new feature in Ray Core that eliminates this overhead by keeping GPU data on the device and transferring directly between actors via RDMA—no unnecessary copies, no serialization.
Powered by high-performance backends like NCCL, Gloo, and RDMA, RDT enables easy and efficient scaling of cutting-edge workloads like reinforcement learning for LLMs and disaggregated multimodal training. In this talk, we’ll show how RDT integrates seamlessly with the familiar Ray ObjectRef API, the architecture behind RDT, and demonstrate how it unlocks fast and flexible distributed GPU programming.
Powered by high-performance backends like NCCL, Gloo, and RDMA, RDT enables easy and efficient scaling of cutting-edge workloads like reinforcement learning for LLMs and disaggregated multimodal training. In this talk, we’ll show how RDT integrates seamlessly with the familiar Ray ObjectRef API, the architecture behind RDT, and demonstrate how it unlocks fast and flexible distributed GPU programming.
03:45 PM - 04:00 PMLightning Theater
Lightning Talk
Text / Docs
Structured Data
Parallelizing Searches over Agentic Pipelines with Ray and syftr
Agentic pipelines are central to applying large language models (LLMs) to proprietary or dynamic data. However, constructing efficient agentic flows presents significant challenges. It necessitates precise selection among various components, including vector databases, embedding models, text splitters, retrievers, and synthesizing LLMs. Further complicating this process is the meticulous tuning required for modules such as verifiers, rewriters, and rerankers, each with their own intricate hyperparameter dependencies. In performance-sensitive applications, manually balancing the tradeoffs between latency, accuracy, and cost becomes progressively more difficult.
We introduce syftr, a framework that performs efficient, distributed, multi-objective search over a vast (1023) space of agentic and non-agentic configurations . Using advances in Bayesian Optimization, syftr discovers Pareto-optimal flows that jointly optimize task accuracy and cost. A novel early-stopping mechanism further improves efficiency by pruning clearly suboptimal candidates. Across multiple benchmarks, syftr finds flows which are on average ≈ 9× cheaper while preserving most of the accuracy of the most accurate flows on the Pareto-frontier. Furthermore, syftr’s ability to design and optimize also allows easy integration of new modules, making it even easier and faster to realize high-performing generative AI pipelines.
Building syftr is especially challenging from an infrastructure point-of-view. For example, one of the most compute intensive parts of the search process is Vector Database (VDBs) construction. As syftr is trying out multiple embedding models, chunk sizes, etc, VDB construction for large datasets forms a large part of the search compute. Small embedding models can be run on CPU (cheap and plenty) while larger ones require GPUs (expensive and scarce). syftr utilizes Ray to distribute this workload in heterogenous compute clusters of CPUs and different GPU SKUs (T4, A100, H100 etc). When we self-host OSS models in the search space, syftr creates inference load hotspots as the optimizer hones in on a few LLMs and embedding models which are part of flows on the Pareto-frontier. Ray Serve provides a way to autoscale high demand models while scaling to zero cold models.
In this talk we go deep into how Ray’s unique abilities of scale, robustness and ease-of-use accelerates research like syftr at the intersection of AI and AI infrastructure.
Paper: https://arxiv.org/abs/2505.20266 (AutoML 2025)
Code: https://github.com/datarobot/syftr
Blog: https://www.datarobot.com/blog/pareto-optimized-ai-workflows-syftr
Read more
Agentic pipelines are central to applying large language models (LLMs) to proprietary or dynamic data. However, constructing efficient agentic flows presents significant challenges. It necessitates precise selection among various components, including vector databases, embedding models, text splitters, retrievers, and synthesizing LLMs. Further complicating this process is the meticulous tuning required for modules such as verifiers, rewriters, and rerankers, each with their own intricate hyperparameter dependencies. In performance-sensitive applications, manually balancing the tradeoffs between latency, accuracy, and cost becomes progressively more difficult.
We introduce syftr, a framework that performs efficient, distributed, multi-objective search over a vast (1023) space of agentic and non-agentic configurations . Using advances in Bayesian Optimization, syftr discovers Pareto-optimal flows that jointly optimize task accuracy and cost. A novel early-stopping mechanism further improves efficiency by pruning clearly suboptimal candidates. Across multiple benchmarks, syftr finds flows which are on average ≈ 9× cheaper while preserving most of the accuracy of the most accurate flows on the Pareto-frontier. Furthermore, syftr’s ability to design and optimize also allows easy integration of new modules, making it even easier and faster to realize high-performing generative AI pipelines.
Building syftr is especially challenging from an infrastructure point-of-view. For example, one of the most compute intensive parts of the search process is Vector Database (VDBs) construction. As syftr is trying out multiple embedding models, chunk sizes, etc, VDB construction for large datasets forms a large part of the search compute. Small embedding models can be run on CPU (cheap and plenty) while larger ones require GPUs (expensive and scarce). syftr utilizes Ray to distribute this workload in heterogenous compute clusters of CPUs and different GPU SKUs (T4, A100, H100 etc). When we self-host OSS models in the search space, syftr creates inference load hotspots as the optimizer hones in on a few LLMs and embedding models which are part of flows on the Pareto-frontier. Ray Serve provides a way to autoscale high demand models while scaling to zero cold models.
In this talk we go deep into how Ray’s unique abilities of scale, robustness and ease-of-use accelerates research like syftr at the intersection of AI and AI infrastructure.
Paper: https://arxiv.org/abs/2505.20266 (AutoML 2025)
Code: https://github.com/datarobot/syftr
Blog: https://www.datarobot.com/blog/pareto-optimized-ai-workflows-syftr
We introduce syftr, a framework that performs efficient, distributed, multi-objective search over a vast (1023) space of agentic and non-agentic configurations . Using advances in Bayesian Optimization, syftr discovers Pareto-optimal flows that jointly optimize task accuracy and cost. A novel early-stopping mechanism further improves efficiency by pruning clearly suboptimal candidates. Across multiple benchmarks, syftr finds flows which are on average ≈ 9× cheaper while preserving most of the accuracy of the most accurate flows on the Pareto-frontier. Furthermore, syftr’s ability to design and optimize also allows easy integration of new modules, making it even easier and faster to realize high-performing generative AI pipelines.
Building syftr is especially challenging from an infrastructure point-of-view. For example, one of the most compute intensive parts of the search process is Vector Database (VDBs) construction. As syftr is trying out multiple embedding models, chunk sizes, etc, VDB construction for large datasets forms a large part of the search compute. Small embedding models can be run on CPU (cheap and plenty) while larger ones require GPUs (expensive and scarce). syftr utilizes Ray to distribute this workload in heterogenous compute clusters of CPUs and different GPU SKUs (T4, A100, H100 etc). When we self-host OSS models in the search space, syftr creates inference load hotspots as the optimizer hones in on a few LLMs and embedding models which are part of flows on the Pareto-frontier. Ray Serve provides a way to autoscale high demand models while scaling to zero cold models.
In this talk we go deep into how Ray’s unique abilities of scale, robustness and ease-of-use accelerates research like syftr at the intersection of AI and AI infrastructure.
Paper: https://arxiv.org/abs/2505.20266 (AutoML 2025)
Code: https://github.com/datarobot/syftr
Blog: https://www.datarobot.com/blog/pareto-optimized-ai-workflows-syftr
04:00 PM - 05:00 PMGift Room
04:00 PM - 05:00 PMGift Room
Gift Redemption Scan
Gift Redemption Scan
Read more
Gift Redemption Scan
Read more04:00 PM - 04:30 PMGolden Gate A
Machine Learning
Text / Docs
Image
Structured Data
Building User Centric Foundation Models with Ray
At Grab, Southeast Asia's leading super app, a single user journey is a rich, multi-modal story. To understand our users holistically, we needed to learn from the complex web of their interactions across our diverse services. Our goal was to build a powerful user embedding foundation model that could capture this complete view, enhancing numerous downstream models and personalizing the user experience.
The core output of this model is a set of powerful, general-purpose user embeddings. These numerical representations act as a universal feature set, designed to fuel a wide array of downstream applications and eliminate the need for siloed, hand-engineered features for each task.
However, off-the-shelf models could not comprehend our unique data ecosystem—a complex blend of long-term tabular profiles and short-term sequential interactions. This forced us to develop a custom transformer architecture that unifies these diverse data types using a novel key-value tokenization strategy and modality-specific adapters.
Training this model on terabytes of data and generating millions of embeddings daily presented a significant scaling challenge, especially for a small team. We overcame this by leveraging ray to build an efficient, distributed computing pipeline.
Today, our pre-trained embeddings are a critical component for systems across the company. They are actively powering a wide range of applications, including Churn Prediction, Ads Optimization, and Dual App Detection, creating millions of fresh embeddings daily for our users, merchants, and drivers.
Read more
At Grab, Southeast Asia's leading super app, a single user journey is a rich, multi-modal story. To understand our users holistically, we needed to learn from the complex web of their interactions across our diverse services. Our goal was to build a powerful user embedding foundation model that could capture this complete view, enhancing numerous downstream models and personalizing the user experience.
The core output of this model is a set of powerful, general-purpose user embeddings. These numerical representations act as a universal feature set, designed to fuel a wide array of downstream applications and eliminate the need for siloed, hand-engineered features for each task.
However, off-the-shelf models could not comprehend our unique data ecosystem—a complex blend of long-term tabular profiles and short-term sequential interactions. This forced us to develop a custom transformer architecture that unifies these diverse data types using a novel key-value tokenization strategy and modality-specific adapters.
Training this model on terabytes of data and generating millions of embeddings daily presented a significant scaling challenge, especially for a small team. We overcame this by leveraging ray to build an efficient, distributed computing pipeline.
Today, our pre-trained embeddings are a critical component for systems across the company. They are actively powering a wide range of applications, including Churn Prediction, Ads Optimization, and Dual App Detection, creating millions of fresh embeddings daily for our users, merchants, and drivers.
The core output of this model is a set of powerful, general-purpose user embeddings. These numerical representations act as a universal feature set, designed to fuel a wide array of downstream applications and eliminate the need for siloed, hand-engineered features for each task.
However, off-the-shelf models could not comprehend our unique data ecosystem—a complex blend of long-term tabular profiles and short-term sequential interactions. This forced us to develop a custom transformer architecture that unifies these diverse data types using a novel key-value tokenization strategy and modality-specific adapters.
Training this model on terabytes of data and generating millions of embeddings daily presented a significant scaling challenge, especially for a small team. We overcame this by leveraging ray to build an efficient, distributed computing pipeline.
Today, our pre-trained embeddings are a critical component for systems across the company. They are actively powering a wide range of applications, including Churn Prediction, Ads Optimization, and Dual App Detection, creating millions of fresh embeddings daily for our users, merchants, and drivers.
04:00 PM - 04:30 PMGolden Gate B
vLLM
Engineering Lessons from Scaling Synthetic Data for Trillion-Scale Pretraining with KubeRay + vLLM
As language models continue to grow in capability and scale, high-quality training data has become a critical bottleneck. Synthetic data generation has emerged as a core technique for creating diverse, targeted datasets that complement organic sources and power models from lightning-fast 4.5B-parameter systems to frontier models like GPT-5.
We will share engineering lessons from building and scaling a production platform that processes trillions of tokens using KubeRay and vLLM, dynamically orchestrating thousands of GPU workers across multimodal recaptioning, rephrasing, and domain-specific content generation tasks. Topics include pushing vLLM inference to near-peak GPU utilization, designing fault-tolerant Ray actors for tensor-parallel sharding, auto-scaling KubeRay clusters to match workload patterns, and applying storage and scheduling strategies that deliver both high performance and significant cost efficiency.
We will highlight practical patterns for building resilient, scalable ML infrastructure. Attendees will learn how clean abstractions between Ray's distributed computing layer and vLLM's inference engine enable rapid iteration on prompt engineering while maintaining production stability, accelerating the journey from research prototypes to trillion-token datasets that define the next generation of AI capabilities.
Read more
As language models continue to grow in capability and scale, high-quality training data has become a critical bottleneck. Synthetic data generation has emerged as a core technique for creating diverse, targeted datasets that complement organic sources and power models from lightning-fast 4.5B-parameter systems to frontier models like GPT-5.
We will share engineering lessons from building and scaling a production platform that processes trillions of tokens using KubeRay and vLLM, dynamically orchestrating thousands of GPU workers across multimodal recaptioning, rephrasing, and domain-specific content generation tasks. Topics include pushing vLLM inference to near-peak GPU utilization, designing fault-tolerant Ray actors for tensor-parallel sharding, auto-scaling KubeRay clusters to match workload patterns, and applying storage and scheduling strategies that deliver both high performance and significant cost efficiency.
We will highlight practical patterns for building resilient, scalable ML infrastructure. Attendees will learn how clean abstractions between Ray's distributed computing layer and vLLM's inference engine enable rapid iteration on prompt engineering while maintaining production stability, accelerating the journey from research prototypes to trillion-token datasets that define the next generation of AI capabilities.
We will share engineering lessons from building and scaling a production platform that processes trillions of tokens using KubeRay and vLLM, dynamically orchestrating thousands of GPU workers across multimodal recaptioning, rephrasing, and domain-specific content generation tasks. Topics include pushing vLLM inference to near-peak GPU utilization, designing fault-tolerant Ray actors for tensor-parallel sharding, auto-scaling KubeRay clusters to match workload patterns, and applying storage and scheduling strategies that deliver both high performance and significant cost efficiency.
We will highlight practical patterns for building resilient, scalable ML infrastructure. Attendees will learn how clean abstractions between Ray's distributed computing layer and vLLM's inference engine enable rapid iteration on prompt engineering while maintaining production stability, accelerating the journey from research prototypes to trillion-token datasets that define the next generation of AI capabilities.
04:00 PM - 04:30 PMGolden Gate C1
Machine Learning
Structured Data
Exabyte-scale Streaming Iceberg IO with Ray, Flink, and DeltaCAT
Production case study highlighting how Amazon uses Ray and DeltaCAT at exabyte-scale to resolve longstanding performance & scale challenges integrating streaming pipelines with Apache Iceberg. Highlights how the Apache Flink, Ray, Apache Beam, and Apache Spark communities can start bringing the same benefits to their workloads using DeltaCAT's Iceberg Table management jobs on Ray together with Flink and Beam.
Read more
Production case study highlighting how Amazon uses Ray and DeltaCAT at exabyte-scale to resolve longstanding performance & scale challenges integrating streaming pipelines with Apache Iceberg. Highlights how the Apache Flink, Ray, Apache Beam, and Apache Spark communities can start bringing the same benefits to their workloads using DeltaCAT's Iceberg Table management jobs on Ray together with Flink and Beam.
Read more04:00 PM - 04:30 PMGolden Gate C3
LLMs
Matrix: reliable framework for data-centric experimentation at scale
Scaled and high quality data is the oil driving progress of AGI in research and development. Thanks to the foundational works such as Ray, Slurm, and vLLM, it becomes a lot easier to manage compute resources at scale and access to a diverse set of SOTA LLMs. However, these efforts are often designed for experienced engineers with entry barriers for researchers to unleash their full potential. Thus, in Fundamental AI Research Lab (FAIR) at Meta, we have built Matrix, a reliable framework for data-centric experimentation at scale, to connect these foundational pieces for researchers to quickly iterate on their ideas and build experiments with large-scale models and data.
Matrix supports robust and auto-scaled data generation from LLMs, game engine, and physics or world model simulators, with one command. It also offers easy setup for scalable data processing and augmentation such as LLM-as-a-judge in batch, safe code execution for verification, and data dedup, classification, or clustering. The framework also offers efficient and reproducible evaluation pipelines for large teams to collaborate on.
Matrix is widely used to empower Meta’s research and production bets in AGI including MLLMs and world modeling. In this session, we will introduce the Matrix framework from its design and synergy with other industry initiatives like Ray and vLLM, to the research and production use cases Matrix enables. We will also provide a short tutorial for developers to join the area.
Read more
Scaled and high quality data is the oil driving progress of AGI in research and development. Thanks to the foundational works such as Ray, Slurm, and vLLM, it becomes a lot easier to manage compute resources at scale and access to a diverse set of SOTA LLMs. However, these efforts are often designed for experienced engineers with entry barriers for researchers to unleash their full potential. Thus, in Fundamental AI Research Lab (FAIR) at Meta, we have built Matrix, a reliable framework for data-centric experimentation at scale, to connect these foundational pieces for researchers to quickly iterate on their ideas and build experiments with large-scale models and data.
Matrix supports robust and auto-scaled data generation from LLMs, game engine, and physics or world model simulators, with one command. It also offers easy setup for scalable data processing and augmentation such as LLM-as-a-judge in batch, safe code execution for verification, and data dedup, classification, or clustering. The framework also offers efficient and reproducible evaluation pipelines for large teams to collaborate on.
Matrix is widely used to empower Meta’s research and production bets in AGI including MLLMs and world modeling. In this session, we will introduce the Matrix framework from its design and synergy with other industry initiatives like Ray and vLLM, to the research and production use cases Matrix enables. We will also provide a short tutorial for developers to join the area.
Matrix supports robust and auto-scaled data generation from LLMs, game engine, and physics or world model simulators, with one command. It also offers easy setup for scalable data processing and augmentation such as LLM-as-a-judge in batch, safe code execution for verification, and data dedup, classification, or clustering. The framework also offers efficient and reproducible evaluation pipelines for large teams to collaborate on.
Matrix is widely used to empower Meta’s research and production bets in AGI including MLLMs and world modeling. In this session, we will introduce the Matrix framework from its design and synergy with other industry initiatives like Ray and vLLM, to the research and production use cases Matrix enables. We will also provide a short tutorial for developers to join the area.
04:00 PM - 04:30 PMYerba Buena 2-3
LLMs
Accelerating Large-Scale AI Deployment with NVIDIA Dynamo
The explosive growth of Large Language Models (LLMs) requires massively efficient and scalable inference systems. This talk will share key innovations NVIDIA Dynamo adds to enable system-level optimizations while leveraging performance from inference engines such as vLLM, SGLang, and TRT-LLM:
- Smart Scheduling that routes requests based on the KV cache hit rate and load, intelligently autoscales, and disaggregates the pre-fill and decode phases.
- Hierarchical Memory Management that utilizes HBM, host memory, local disk, and remote storage.
- Low-Latency Transfer of the KV cache across nodes and the memory hierarchy.
This talk will also introduce production-grade LLM serving features of Dynamo that enable users to:
- Find the best configuration for disaggregated serving offline.
- Tune performance automatically based on real-time traffic.
- Dynamically scale prefill and decode workers via topology-aware gang scheduling.
- Leverage LLM-specific fault tolerance.
Read more
The explosive growth of Large Language Models (LLMs) requires massively efficient and scalable inference systems. This talk will share key innovations NVIDIA Dynamo adds to enable system-level optimizations while leveraging performance from inference engines such as vLLM, SGLang, and TRT-LLM:
- Smart Scheduling that routes requests based on the KV cache hit rate and load, intelligently autoscales, and disaggregates the pre-fill and decode phases.
- Hierarchical Memory Management that utilizes HBM, host memory, local disk, and remote storage.
- Low-Latency Transfer of the KV cache across nodes and the memory hierarchy.
This talk will also introduce production-grade LLM serving features of Dynamo that enable users to:
- Find the best configuration for disaggregated serving offline.
- Tune performance automatically based on real-time traffic.
- Dynamically scale prefill and decode workers via topology-aware gang scheduling.
- Leverage LLM-specific fault tolerance.
- Smart Scheduling that routes requests based on the KV cache hit rate and load, intelligently autoscales, and disaggregates the pre-fill and decode phases.
- Hierarchical Memory Management that utilizes HBM, host memory, local disk, and remote storage.
- Low-Latency Transfer of the KV cache across nodes and the memory hierarchy.
This talk will also introduce production-grade LLM serving features of Dynamo that enable users to:
- Find the best configuration for disaggregated serving offline.
- Tune performance automatically based on real-time traffic.
- Dynamically scale prefill and decode workers via topology-aware gang scheduling.
- Leverage LLM-specific fault tolerance.
04:00 PM - 04:30 PMYerba Buena Salon 4-6
Ray Deep Dives
Structured Data Processing in Ray Data
Ray Data is a data processing engine purpose-built for ML/AI workloads. In the past year, we've invested a lot in adding support for more traditional tabular data processing to Ray Data. In this talk, we'll discuss key new features like shuffles, joins, aggregations and columnar expressions; discuss new architectural changes needed to support these features; and showcase performance improvements that we see across Ray versions from these changes. We'll end with a discussion of future roadmap and case studies of Ray users successfully deploying Ray Dat in production.
Read more
Ray Data is a data processing engine purpose-built for ML/AI workloads. In the past year, we've invested a lot in adding support for more traditional tabular data processing to Ray Data. In this talk, we'll discuss key new features like shuffles, joins, aggregations and columnar expressions; discuss new architectural changes needed to support these features; and showcase performance improvements that we see across Ray versions from these changes. We'll end with a discussion of future roadmap and case studies of Ray users successfully deploying Ray Dat in production.
Read more04:00 PM - 04:30 PMYerba Buena Salon 10-12
Reinforcement Learning
SkyRL tx: A unified training and inference engine
SkyRL tx is an open-source implementation of Thinking Machines' Tinker API which unifies transformer training and inference into a single REST-based interface. SkyRL tx implements the post-training system as an inference engine that also supports backward passes, and therefore eliminates the complexity of maintaining separate training and inference stacks. The system leverages LoRA for cost-effective multi-tenancy, allowing many users to share a base model with their own efficient adapters.
This talk will cover SkyRL tx's architecture and implementation and some of the design decisions we are making, as well as the project's roadmap and opportunities for community contribution. SkyRL tx targets researchers and developers who want to understand and extend the implementation for their own use cases, as well as organizations that want to run their own Tinker-compatible backend on their own hardware.
Read more
SkyRL tx is an open-source implementation of Thinking Machines' Tinker API which unifies transformer training and inference into a single REST-based interface. SkyRL tx implements the post-training system as an inference engine that also supports backward passes, and therefore eliminates the complexity of maintaining separate training and inference stacks. The system leverages LoRA for cost-effective multi-tenancy, allowing many users to share a base model with their own efficient adapters.
This talk will cover SkyRL tx's architecture and implementation and some of the design decisions we are making, as well as the project's roadmap and opportunities for community contribution. SkyRL tx targets researchers and developers who want to understand and extend the implementation for their own use cases, as well as organizations that want to run their own Tinker-compatible backend on their own hardware.
This talk will cover SkyRL tx's architecture and implementation and some of the design decisions we are making, as well as the project's roadmap and opportunities for community contribution. SkyRL tx targets researchers and developers who want to understand and extend the implementation for their own use cases, as well as organizations that want to run their own Tinker-compatible backend on their own hardware.
04:30 PM - 04:45 PMLightning Theater
vLLM
Text / Docs
vLLM with the Transformers backend: One model definition to rule them all
What if you could use the same model implementation for training and inference?
With the Transformers backend for vLLM, that’s now a reality. vLLM can run any Transformers-compatible model (merged into Transformers or custom!) directly from its original definition at vLLM speeds! Your modeling code becomes the single source of truth, while vLLM handles high-performance inference, attention optimizations, and scaling.
In this session we’ll take a whirlwind tour of: what the Transformers backend is, how to make a model compatible, and how this enables teams to use the same modeling codebase for research and production.
Read more
What if you could use the same model implementation for training and inference?
With the Transformers backend for vLLM, that’s now a reality. vLLM can run any Transformers-compatible model (merged into Transformers or custom!) directly from its original definition at vLLM speeds! Your modeling code becomes the single source of truth, while vLLM handles high-performance inference, attention optimizations, and scaling.
In this session we’ll take a whirlwind tour of: what the Transformers backend is, how to make a model compatible, and how this enables teams to use the same modeling codebase for research and production.
With the Transformers backend for vLLM, that’s now a reality. vLLM can run any Transformers-compatible model (merged into Transformers or custom!) directly from its original definition at vLLM speeds! Your modeling code becomes the single source of truth, while vLLM handles high-performance inference, attention optimizations, and scaling.
In this session we’ll take a whirlwind tour of: what the Transformers backend is, how to make a model compatible, and how this enables teams to use the same modeling codebase for research and production.
04:45 PM - 05:00 PMLightning Theater
vLLM
Finance
Text / Docs
Leveraging Ray, vLLM and LiteLLM built a trusted LLM service for sensitive data
In H1 2025, Coinbase MLP team built a trusted LLM services by leveraging Ray, vLLM and LiteLLM to make sure Coinbase as the most trusted Crypto exchange.
In this talk, we will go through technical details of user authentication, s2s, litellm distribution, vLLM and Ray to share the whole story of how Coinbase uses Ray and vLLM to build LLM serving API and support internal LLM traffic.
Read more
In H1 2025, Coinbase MLP team built a trusted LLM services by leveraging Ray, vLLM and LiteLLM to make sure Coinbase as the most trusted Crypto exchange.
In this talk, we will go through technical details of user authentication, s2s, litellm distribution, vLLM and Ray to share the whole story of how Coinbase uses Ray and vLLM to build LLM serving API and support internal LLM traffic.
In this talk, we will go through technical details of user authentication, s2s, litellm distribution, vLLM and Ray to share the whole story of how Coinbase uses Ray and vLLM to build LLM serving API and support internal LLM traffic.
05:00 PM - 05:15 PMLightning Theater
Machine Learning
Taming Distributed AI Training with Ray and Datadog Observability
Training large language models and other AI systems often means orchestrating thousands of tasks across clusters of GPUs. That’s where Ray shines—but at scale, things get messy. Jobs stall, GPUs sit idle, or workloads crawl without obvious reasons.
At Datadog, we run Ray internally to power LLM and AI training, and we’ve built observability practices to keep those jobs running fast and reliably. In this talk, we’ll share the real-world challenges of monitoring distributed Ray clusters—and the techniques we use to solve them.
You’ll learn:
- What goes wrong when running Ray across large, multi-GPU clusters.
- The signals (metrics, traces, logs) that actually help debug slow or failing jobs.
- How observability turns “black-box” AI training into something you can reason about and improve.
Whether you’re training your first model on Ray or running production-scale AI jobs, you’ll walk away with practical strategies for making distributed AI workloads more reliable, explainable, and performant.
Read more
Training large language models and other AI systems often means orchestrating thousands of tasks across clusters of GPUs. That’s where Ray shines—but at scale, things get messy. Jobs stall, GPUs sit idle, or workloads crawl without obvious reasons.
At Datadog, we run Ray internally to power LLM and AI training, and we’ve built observability practices to keep those jobs running fast and reliably. In this talk, we’ll share the real-world challenges of monitoring distributed Ray clusters—and the techniques we use to solve them.
You’ll learn:
- What goes wrong when running Ray across large, multi-GPU clusters.
- The signals (metrics, traces, logs) that actually help debug slow or failing jobs.
- How observability turns “black-box” AI training into something you can reason about and improve.
Whether you’re training your first model on Ray or running production-scale AI jobs, you’ll walk away with practical strategies for making distributed AI workloads more reliable, explainable, and performant.
At Datadog, we run Ray internally to power LLM and AI training, and we’ve built observability practices to keep those jobs running fast and reliably. In this talk, we’ll share the real-world challenges of monitoring distributed Ray clusters—and the techniques we use to solve them.
You’ll learn:
- What goes wrong when running Ray across large, multi-GPU clusters.
- The signals (metrics, traces, logs) that actually help debug slow or failing jobs.
- How observability turns “black-box” AI training into something you can reason about and improve.
Whether you’re training your first model on Ray or running production-scale AI jobs, you’ll walk away with practical strategies for making distributed AI workloads more reliable, explainable, and performant.
05:00 PM - 07:00 PMRayground
Ray Summit Celebration Happy Hour
Join colleagues and friends in the Rayground for demos, sponsors and Ray conversations.
Read more
Join colleagues and friends in the Rayground for demos, sponsors and Ray conversations.
Read more05:15 PM - 05:30 PMLightning Theater
Machine Learning
What Production Ray Really Requires: A Guide to Operators, Observability, and Infrastructure
Scaling Ray often introduces layered challenges that burn out your team. First, there's the operational toil: your platform engineers are stuck manually managing and updating the Ray operator, a fragile process that slows everyone down. To solve this, we'll show how the KubeRay GKE addon - a managed, auto-updating component - eliminates this burden entirely.
Next, there's the diagnostic black hole. When a job fails, your teams are stuck facing cascading failures: is it a Ray application error or a GKE infrastructure issue? To solve this, we'll introduce the RayJob observability dashboard. It provides a single, unified view within Google Cloud Logging and Monitoring, putting your Ray-level logs and metrics in the same place as GKE pod and cluster events, so you can diagnose the root cause faster.
Then, you hit performance cliffs at scale. To stop the guesswork, we'll show how GCP’s purpose-built infrastructure, like Titanium ML networking for inter-node communication, Hyperdisk ML for data I/O and secondary boot disk for faster image loading prevents the hidden bottlenecks that cripple large jobs.
Finally, for teams who want to run Anyscale's commercial product on their own infrastructure, we’ll introduce RayTurbo Standalone. It’s a drop-in replacement for open-source Ray, allowing you to leverage the performance of RayTurbo directly within GCP environments like internal ML Platforms.
You’ll leave understanding how an integrated platform solves these distinct problems so you can scale with confidence.
Read more
Scaling Ray often introduces layered challenges that burn out your team. First, there's the operational toil: your platform engineers are stuck manually managing and updating the Ray operator, a fragile process that slows everyone down. To solve this, we'll show how the KubeRay GKE addon - a managed, auto-updating component - eliminates this burden entirely.
Next, there's the diagnostic black hole. When a job fails, your teams are stuck facing cascading failures: is it a Ray application error or a GKE infrastructure issue? To solve this, we'll introduce the RayJob observability dashboard. It provides a single, unified view within Google Cloud Logging and Monitoring, putting your Ray-level logs and metrics in the same place as GKE pod and cluster events, so you can diagnose the root cause faster.
Then, you hit performance cliffs at scale. To stop the guesswork, we'll show how GCP’s purpose-built infrastructure, like Titanium ML networking for inter-node communication, Hyperdisk ML for data I/O and secondary boot disk for faster image loading prevents the hidden bottlenecks that cripple large jobs.
Finally, for teams who want to run Anyscale's commercial product on their own infrastructure, we’ll introduce RayTurbo Standalone. It’s a drop-in replacement for open-source Ray, allowing you to leverage the performance of RayTurbo directly within GCP environments like internal ML Platforms.
You’ll leave understanding how an integrated platform solves these distinct problems so you can scale with confidence.
Next, there's the diagnostic black hole. When a job fails, your teams are stuck facing cascading failures: is it a Ray application error or a GKE infrastructure issue? To solve this, we'll introduce the RayJob observability dashboard. It provides a single, unified view within Google Cloud Logging and Monitoring, putting your Ray-level logs and metrics in the same place as GKE pod and cluster events, so you can diagnose the root cause faster.
Then, you hit performance cliffs at scale. To stop the guesswork, we'll show how GCP’s purpose-built infrastructure, like Titanium ML networking for inter-node communication, Hyperdisk ML for data I/O and secondary boot disk for faster image loading prevents the hidden bottlenecks that cripple large jobs.
Finally, for teams who want to run Anyscale's commercial product on their own infrastructure, we’ll introduce RayTurbo Standalone. It’s a drop-in replacement for open-source Ray, allowing you to leverage the performance of RayTurbo directly within GCP environments like internal ML Platforms.
You’ll leave understanding how an integrated platform solves these distinct problems so you can scale with confidence.
05:30 PM - 05:45 PMLightning Theater
LLMs
Streamlining Production LLM Inference with EKS Auto Mode and Ray Serve
Running LLM inference at scale shouldn’t require a PhD in Kubernetes operations. This talk demonstrates how EKS Auto Mode eliminates operational overhead while Ray Serve handles the complexity of serving large language models—letting your team focus on delivering AI value, not managing infrastructure.
We’ll showcase a real-world deployment transformation: moving from manual cluster management nightmares to a fully automated, production-ready LLM serving platform. You’ll see how EKS Auto Mode’s intelligent node provisioning, automatic scaling, and built-in observability specifically address AI/ML workload demands—including GPU management, burst capacity handling, and cost optimization for expensive inference hardware. Walk away with a blueprint for deploying cost-efficient, self-healing LLM inference infrastructure that scales from prototype to production without operational complexity. Perfect for ML engineers tired of wrestling with Kubernetes and platform teams seeking turnkey AI infrastructure solutions.
Read more
Running LLM inference at scale shouldn’t require a PhD in Kubernetes operations. This talk demonstrates how EKS Auto Mode eliminates operational overhead while Ray Serve handles the complexity of serving large language models—letting your team focus on delivering AI value, not managing infrastructure.
We’ll showcase a real-world deployment transformation: moving from manual cluster management nightmares to a fully automated, production-ready LLM serving platform. You’ll see how EKS Auto Mode’s intelligent node provisioning, automatic scaling, and built-in observability specifically address AI/ML workload demands—including GPU management, burst capacity handling, and cost optimization for expensive inference hardware. Walk away with a blueprint for deploying cost-efficient, self-healing LLM inference infrastructure that scales from prototype to production without operational complexity. Perfect for ML engineers tired of wrestling with Kubernetes and platform teams seeking turnkey AI infrastructure solutions.
We’ll showcase a real-world deployment transformation: moving from manual cluster management nightmares to a fully automated, production-ready LLM serving platform. You’ll see how EKS Auto Mode’s intelligent node provisioning, automatic scaling, and built-in observability specifically address AI/ML workload demands—including GPU management, burst capacity handling, and cost optimization for expensive inference hardware. Walk away with a blueprint for deploying cost-efficient, self-healing LLM inference infrastructure that scales from prototype to production without operational complexity. Perfect for ML engineers tired of wrestling with Kubernetes and platform teams seeking turnkey AI infrastructure solutions.
05:45 PM - 06:00 PMLightning Theater
vLLM
Building the Future of Inference: DigitalOcean’s Journey with Ray, vLLM, and Beyond.
As generative models grow in size, context length, and modality, the challenge of delivering reliable and efficient inference at scale becomes increasingly complex. This talk presents how our team built a robust inference platform powered by Ray and vLLM running on Kubernetes over GPUs, enabling both server-less and dedicated inference modes. We’ll dive into how Ray’s scheduling primitives, placement groups, and observability tools drive reliability and elasticity across workloads, while vLLM ensures efficient token streaming and memory management.
We’ll explore server-less inference for dynamic scaling and dedicated inference for optimized GPU partitioning and quantization. Then we’ll discuss ongoing inference optimization initiatives addressing degraded accuracy and performance for long-context models (>8k tokens) including dynamic batching by token length, KV cache reuse, and speculative decoding.
Finally, we’ll outline our multimodal and multi-tenant roadmap, focusing on concurrent model orchestration, isolation, and security-aware billing, culminating in a vision for a centralized orchestration layer using Ray as the control plane and a unified model registry for intelligent model placement and prioritization.
Target audience: this session is designed for AI infra experts building scalable, reliable, and future-ready AI inference systems. Also suited for those looking to get started on building inference serving stack.
Read more
As generative models grow in size, context length, and modality, the challenge of delivering reliable and efficient inference at scale becomes increasingly complex. This talk presents how our team built a robust inference platform powered by Ray and vLLM running on Kubernetes over GPUs, enabling both server-less and dedicated inference modes. We’ll dive into how Ray’s scheduling primitives, placement groups, and observability tools drive reliability and elasticity across workloads, while vLLM ensures efficient token streaming and memory management.
We’ll explore server-less inference for dynamic scaling and dedicated inference for optimized GPU partitioning and quantization. Then we’ll discuss ongoing inference optimization initiatives addressing degraded accuracy and performance for long-context models (>8k tokens) including dynamic batching by token length, KV cache reuse, and speculative decoding.
Finally, we’ll outline our multimodal and multi-tenant roadmap, focusing on concurrent model orchestration, isolation, and security-aware billing, culminating in a vision for a centralized orchestration layer using Ray as the control plane and a unified model registry for intelligent model placement and prioritization.
Target audience: this session is designed for AI infra experts building scalable, reliable, and future-ready AI inference systems. Also suited for those looking to get started on building inference serving stack.
We’ll explore server-less inference for dynamic scaling and dedicated inference for optimized GPU partitioning and quantization. Then we’ll discuss ongoing inference optimization initiatives addressing degraded accuracy and performance for long-context models (>8k tokens) including dynamic batching by token length, KV cache reuse, and speculative decoding.
Finally, we’ll outline our multimodal and multi-tenant roadmap, focusing on concurrent model orchestration, isolation, and security-aware billing, culminating in a vision for a centralized orchestration layer using Ray as the control plane and a unified model registry for intelligent model placement and prioritization.
Target audience: this session is designed for AI infra experts building scalable, reliable, and future-ready AI inference systems. Also suited for those looking to get started on building inference serving stack.
06:00 PM - 06:15 PMLightning Theater
Lightning Talk
Text / Docs
Structured Data
Synthetic data generation with ray data + serve + vLLM
This talk covers design patterns and considerations for combining ray data + serve + vLLM to construct scalable, high throughput pipelines for synthetic data generation. As an illustrative example, we implement a two-agent self-refinement loop using ray.serve + vLLM and integrate it into a ray.data pipeline.
Read more
This talk covers design patterns and considerations for combining ray data + serve + vLLM to construct scalable, high throughput pipelines for synthetic data generation. As an illustrative example, we implement a two-agent self-refinement loop using ray.serve + vLLM and integrate it into a ray.data pipeline.
Read more06:15 PM - 06:30 PMLightning Theater
Lightning Talk
Structured Data
Improved Scheduling Flexibility with Label Selectors in Ray
Acquiring scarce accelerator resources for Ray applications on a heterogeneous cluster can be challenging due to different accelerator type and topology requirements and limited availability. These issues previously required workarounds such as setting custom resources and accelerator_type.
Ray's new Label Selector API helps alleviate these challenges by enabling users to schedule tasks, actors, and placement groups using Ray node labels specified at RayCluster creation by the user and detected automatically by Ray. This API offers support for both static and auto-scaling RayClusters, fallback strategies, and per-bundle selectors, enabling users to make precise placement decisions at the application level. This functionality is incorporated in the Anyscale platform, Ray dashboard, and KubeRay. The same user code operates identically across platforms.
This talk will primarily explore common use cases, API modifications, and a live demo highlighting how the new label selector API enhances scheduling flexibility.
Read more
Acquiring scarce accelerator resources for Ray applications on a heterogeneous cluster can be challenging due to different accelerator type and topology requirements and limited availability. These issues previously required workarounds such as setting custom resources and accelerator_type.
Ray's new Label Selector API helps alleviate these challenges by enabling users to schedule tasks, actors, and placement groups using Ray node labels specified at RayCluster creation by the user and detected automatically by Ray. This API offers support for both static and auto-scaling RayClusters, fallback strategies, and per-bundle selectors, enabling users to make precise placement decisions at the application level. This functionality is incorporated in the Anyscale platform, Ray dashboard, and KubeRay. The same user code operates identically across platforms.
This talk will primarily explore common use cases, API modifications, and a live demo highlighting how the new label selector API enhances scheduling flexibility.
Ray's new Label Selector API helps alleviate these challenges by enabling users to schedule tasks, actors, and placement groups using Ray node labels specified at RayCluster creation by the user and detected automatically by Ray. This API offers support for both static and auto-scaling RayClusters, fallback strategies, and per-bundle selectors, enabling users to make precise placement decisions at the application level. This functionality is incorporated in the Anyscale platform, Ray dashboard, and KubeRay. The same user code operates identically across platforms.
This talk will primarily explore common use cases, API modifications, and a live demo highlighting how the new label selector API enhances scheduling flexibility.
06:30 PM - 06:45 PMLightning Theater
Machine Learning
Enabling a Dynamic Data Plane for Ray with the VAST AI Operating System
Ray orchestrates the intelligent part of AI--compute--but assumes that persistent data lives in a far-away, unintelligent world. The VAST AI Operating System challenges that assumption by matching Ray's dynamic, resource-aware orchestration with a data plane that is equally parallel, expressive, and programmable. In this talk, we will describe how VAST AI OS not only supports essential data platform features such as exabyte namespaces and terabytes per second of bandwidth, but higher-order capabilities such as fine-grained snapshots and native eventing infrastructure that allow data to actively participate in the workflows and pipelines that drive compute orchestration.
Read more
Ray orchestrates the intelligent part of AI--compute--but assumes that persistent data lives in a far-away, unintelligent world. The VAST AI Operating System challenges that assumption by matching Ray's dynamic, resource-aware orchestration with a data plane that is equally parallel, expressive, and programmable. In this talk, we will describe how VAST AI OS not only supports essential data platform features such as exabyte namespaces and terabytes per second of bandwidth, but higher-order capabilities such as fine-grained snapshots and native eventing infrastructure that allow data to actively participate in the workflows and pipelines that drive compute orchestration.
Read more