Sessions
Explore the expanding lineup of confirmed sessions and speakers for Ray Summit, with more updates and an official agenda builder coming soon.
Building Reliable, High-Velocity Model Serving Platform with Ray Serve
In the rapidly evolving landscape of machine learning deployments, organizations face dual challenges: maintaining high deployment velocity while ensuring system reliability. We present a novel approach to ML model serving infrastructure using Ray Serve, demonstrating significant improvements in both deployment speed and system reliability.
We developed a self-service platform that enhances the traditional model deployment pipeline through dynamic Ray Serve APIs. Our architecture implements dedicated cluster isolation for different use cases, ensuring reliability while maintaining rapid deployment capabilities. The platform incorporates automated reliability checks within the self-serve deployment process, enabling engineers to deploy models independently while maintaining system integrity.
A key feature of our platform is the seamless integration of Large Language Models (LLMs) and Multimodal LLMs (MLLMs) through Ray LLM APIs. This integration has substantially improved experimentation velocity, allowing teams to rapidly onboard and evaluate models. The platform also includes batch inference capabilities, providing cost-effective alternatives to API-based model providers.
Key Results and Takeaways:
1. Automated Self-serve Deployments: Reduced deployment time from days to minutes while maintaining reliability through built-in checks
2. Enhanced Reliability: Dedicated clusters ensuring performance isolation across different ML applications
3. Accelerated LLM Integration: Streamlined onboarding process for rapid experimentation with LLM/MLLM models
4. Scalable Architecture: Unified framework supporting both traditional ML and LLM workloads in production
Read more
In the rapidly evolving landscape of machine learning deployments, organizations face dual challenges: maintaining high deployment velocity while ensuring system reliability. We present a novel approach to ML model serving infrastructure using Ray Serve, demonstrating significant improvements in both deployment speed and system reliability. We developed a self-service platform that enhances the traditional model deployment pipeline through dynamic Ray Serve APIs. Our architecture implements dedicated cluster isolation for different use cases, ensuring reliability while maintaining rapid deployment capabilities. The platform incorporates automated reliability checks within the self-serve deployment process, enabling engineers to deploy models independently while maintaining system integrity. A key feature of our platform is the seamless integration of Large Language Models (LLMs) and Multimodal LLMs (MLLMs) through Ray LLM APIs. This integration has substantially improved experimentation velocity, allowing teams to rapidly onboard and evaluate models. The platform also includes batch inference capabilities, providing cost-effective alternatives to API-based model providers. Key Results and Takeaways: 1. Automated Self-serve Deployments: Reduced deployment time from days to minutes while maintaining reliability through built-in checks 2. Enhanced Reliability: Dedicated clusters ensuring performance isolation across different ML applications 3. Accelerated LLM Integration: Streamlined onboarding process for rapid experimentation with LLM/MLLM models 4. Scalable Architecture: Unified framework supporting both traditional ML and LLM workloads in production
Read moreWe developed a self-service platform that enhances the traditional model deployment pipeline through dynamic Ray Serve APIs. Our architecture implements dedicated cluster isolation for different use cases, ensuring reliability while maintaining rapid deployment capabilities. The platform incorporates automated reliability checks within the self-serve deployment process, enabling engineers to deploy models independently while maintaining system integrity.
A key feature of our platform is the seamless integration of Large Language Models (LLMs) and Multimodal LLMs (MLLMs) through Ray LLM APIs. This integration has substantially improved experimentation velocity, allowing teams to rapidly onboard and evaluate models. The platform also includes batch inference capabilities, providing cost-effective alternatives to API-based model providers.
Key Results and Takeaways:
1. Automated Self-serve Deployments: Reduced deployment time from days to minutes while maintaining reliability through built-in checks
2. Enhanced Reliability: Dedicated clusters ensuring performance isolation across different ML applications
3. Accelerated LLM Integration: Streamlined onboarding process for rapid experimentation with LLM/MLLM models
4. Scalable Architecture: Unified framework supporting both traditional ML and LLM workloads in production
Scaling Image and Video Processing with Ray
Scaling Image and Video Processing with Ray
Read more
Scaling Image and Video Processing with Ray
Read moreEnd-to-End Hybrid Reinforcement and Imitation Learning for Robotics with Ray
The nature of machine learning in robotics demands complex abstractions of hardware and training/simulation layers to use a combination of RL and IL. In this respect, policy learning for robotics rarely fits on one kind of machine. For instance, massive simulation parallelization with GPU physics and rendering in Isaac Lab demand RTX‑class GPUs, while policy training benefits from large VRAM and FLOPs. Over the past year we have built our infra on Ray that hides this hardware/software diversity and lets researchers focus on science, not sys‑admin.
Our platform offers:
Unified orchestration – a single Ray workflow allows to train full state RL models that are used to train multi-task IL policy, and evaluation in simulation.
Heterogeneous GPU scheduling – placement groups assign Isaac Lab simulators to RTX workers and gradient computation to A100/H100 trainers without manual mapping.
Isolated deployment targets – the same job definition that trains a policy can package it into a lightweight Ray Serve micro‑service that runs next to the robot or on a nearby edge server, shielding control code from research churn.
During the live demo we will:
Launch a hybrid RL‑IL run that automatically provisions both RTX and H100 nodes.
Watch Ray adapt the cluster as workloads shift from simulation to learning to evaluation.
Deploy the resulting policy to an isolated runtime on the robot—ready for immediate testing.
Attendees will leave with practical design patterns for juggling simulator‑heavy and large‑scale network training inside one reproducible Ray ecosystem, plus insights on meeting real‑time robotics constraints while remaining GPU‑efficient.
Read more
The nature of machine learning in robotics demands complex abstractions of hardware and training/simulation layers to use a combination of RL and IL. In this respect, policy learning for robotics rarely fits on one kind of machine. For instance, massive simulation parallelization with GPU physics and rendering in Isaac Lab demand RTX‑class GPUs, while policy training benefits from large VRAM and FLOPs. Over the past year we have built our infra on Ray that hides this hardware/software diversity and lets researchers focus on science, not sys‑admin. Our platform offers: Unified orchestration – a single Ray workflow allows to train full state RL models that are used to train multi-task IL policy, and evaluation in simulation. Heterogeneous GPU scheduling – placement groups assign Isaac Lab simulators to RTX workers and gradient computation to A100/H100 trainers without manual mapping. Isolated deployment targets – the same job definition that trains a policy can package it into a lightweight Ray Serve micro‑service that runs next to the robot or on a nearby edge server, shielding control code from research churn. During the live demo we will: Launch a hybrid RL‑IL run that automatically provisions both RTX and H100 nodes. Watch Ray adapt the cluster as workloads shift from simulation to learning to evaluation. Deploy the resulting policy to an isolated runtime on the robot—ready for immediate testing. Attendees will leave with practical design patterns for juggling simulator‑heavy and large‑scale network training inside one reproducible Ray ecosystem, plus insights on meeting real‑time robotics constraints while remaining GPU‑efficient.
Read moreOur platform offers:
Unified orchestration – a single Ray workflow allows to train full state RL models that are used to train multi-task IL policy, and evaluation in simulation.
Heterogeneous GPU scheduling – placement groups assign Isaac Lab simulators to RTX workers and gradient computation to A100/H100 trainers without manual mapping.
Isolated deployment targets – the same job definition that trains a policy can package it into a lightweight Ray Serve micro‑service that runs next to the robot or on a nearby edge server, shielding control code from research churn.
During the live demo we will:
Launch a hybrid RL‑IL run that automatically provisions both RTX and H100 nodes.
Watch Ray adapt the cluster as workloads shift from simulation to learning to evaluation.
Deploy the resulting policy to an isolated runtime on the robot—ready for immediate testing.
Attendees will leave with practical design patterns for juggling simulator‑heavy and large‑scale network training inside one reproducible Ray ecosystem, plus insights on meeting real‑time robotics constraints while remaining GPU‑efficient.
Operationalizing Ray for Real-Time Inference at Scale
The KubeRay project simplifies the deployment of Ray clusters on Kubernetes. However, hardening these clusters to meet stringent production non-functional requirements (NFRs) typical of regulated environments such as those in finance requires additional engineering effort. These NFRs include multi-region and multi-AZ resilience, deployments with zero downtime, proactive autoscaling, automated production validation testing, and robust integration with monitoring and logging systems. Achieving this requires strict adherence to integration patterns and architectural discipline. This work outlines the engineering patterns and platform enhancements we applied to deploy a real-time recommendation system on a Ray cluster on EKS in production.
The system architecture routes traffic through Route53 to NLB, ALB, and finally nginx ingress to reach Ray Serve deployments running in isolated Kubernetes namespaces. Ray Serve’s proxy actors distributed across worker pods act as distributed ingress routers for balancing traffic across replicas, buffering during surges, and enabling concurrent processing without blocking. Inference workloads run on dedicated Ray clusters on separat Kubernetes namespaces managed by the centralized KubeRay Operator. Compute and network isolation is ensured by network policies and RBAC
Autoscaling is handled at three levels: Ray Serve scales replicas based on request queue depth, KubeRay Autoscaler adjusts Ray worker pod counts based on cluster metrics, and the AWS Cluster Autoscaler provisions EC2 instances based on pending pods awaiting compute resources. This ensures responsiveness during traffic spikes while avoiding over-provisioning.
To maximize GPU utilization and reduce latency, the platform leverages Ray Serve’s dynamic batching in combination with vLLM to perform batched LLM inference. This approach ensures high-throughput, low-latency processing, especially under variable request loads, by grouping requests at runtime based on traffic characteristics.
Observability is achieved through an integrated Prometheus and Grafana stack. A PodMonitor scrapes metrics from Ray components, which are then ingested by Prometheus and visualized in Grafana for real-time analysis and alerting. In parallel, a Fluentd DaemonSet captures logs from Ray and application pods, forwarding them to AWS CloudWatch. These logs are then ingested into Splunk for centralized search, monitoring, and audit compliance. The apps are also monitored at cluster level by Dynatrace and Datadog to further enhance observability and monitoring capabilities.
To enable robust and disruption-free deployments, the platform uses a blue-green deployment pipeline built with Spinnaker. The pipeline includes progressive rollout stages, automated validation, manual approval gates, and rollback paths.
This robust system demonstrates a scalable, resilient, and observability-driven approach to deploying real-time inference and LLM workloads on Ray. Attendees will gain valuable insights into end-to-end development and deployment architecture, GPU workload optimization, and operationalization of Ray Serve in production environments.
Read more
The KubeRay project simplifies the deployment of Ray clusters on Kubernetes. However, hardening these clusters to meet stringent production non-functional requirements (NFRs) typical of regulated environments such as those in finance requires additional engineering effort. These NFRs include multi-region and multi-AZ resilience, deployments with zero downtime, proactive autoscaling, automated production validation testing, and robust integration with monitoring and logging systems. Achieving this requires strict adherence to integration patterns and architectural discipline. This work outlines the engineering patterns and platform enhancements we applied to deploy a real-time recommendation system on a Ray cluster on EKS in production. The system architecture routes traffic through Route53 to NLB, ALB, and finally nginx ingress to reach Ray Serve deployments running in isolated Kubernetes namespaces. Ray Serve’s proxy actors distributed across worker pods act as distributed ingress routers for balancing traffic across replicas, buffering during surges, and enabling concurrent processing without blocking. Inference workloads run on dedicated Ray clusters on separat Kubernetes namespaces managed by the centralized KubeRay Operator. Compute and network isolation is ensured by network policies and RBAC Autoscaling is handled at three levels: Ray Serve scales replicas based on request queue depth, KubeRay Autoscaler adjusts Ray worker pod counts based on cluster metrics, and the AWS Cluster Autoscaler provisions EC2 instances based on pending pods awaiting compute resources. This ensures responsiveness during traffic spikes while avoiding over-provisioning. To maximize GPU utilization and reduce latency, the platform leverages Ray Serve’s dynamic batching in combination with vLLM to perform batched LLM inference. This approach ensures high-throughput, low-latency processing, especially under variable request loads, by grouping requests at runtime based on traffic characteristics. Observability is achieved through an integrated Prometheus and Grafana stack. A PodMonitor scrapes metrics from Ray components, which are then ingested by Prometheus and visualized in Grafana for real-time analysis and alerting. In parallel, a Fluentd DaemonSet captures logs from Ray and application pods, forwarding them to AWS CloudWatch. These logs are then ingested into Splunk for centralized search, monitoring, and audit compliance. The apps are also monitored at cluster level by Dynatrace and Datadog to further enhance observability and monitoring capabilities. To enable robust and disruption-free deployments, the platform uses a blue-green deployment pipeline built with Spinnaker. The pipeline includes progressive rollout stages, automated validation, manual approval gates, and rollback paths. This robust system demonstrates a scalable, resilient, and observability-driven approach to deploying real-time inference and LLM workloads on Ray. Attendees will gain valuable insights into end-to-end development and deployment architecture, GPU workload optimization, and operationalization of Ray Serve in production environments.
Read moreThe system architecture routes traffic through Route53 to NLB, ALB, and finally nginx ingress to reach Ray Serve deployments running in isolated Kubernetes namespaces. Ray Serve’s proxy actors distributed across worker pods act as distributed ingress routers for balancing traffic across replicas, buffering during surges, and enabling concurrent processing without blocking. Inference workloads run on dedicated Ray clusters on separat Kubernetes namespaces managed by the centralized KubeRay Operator. Compute and network isolation is ensured by network policies and RBAC
Autoscaling is handled at three levels: Ray Serve scales replicas based on request queue depth, KubeRay Autoscaler adjusts Ray worker pod counts based on cluster metrics, and the AWS Cluster Autoscaler provisions EC2 instances based on pending pods awaiting compute resources. This ensures responsiveness during traffic spikes while avoiding over-provisioning.
To maximize GPU utilization and reduce latency, the platform leverages Ray Serve’s dynamic batching in combination with vLLM to perform batched LLM inference. This approach ensures high-throughput, low-latency processing, especially under variable request loads, by grouping requests at runtime based on traffic characteristics.
Observability is achieved through an integrated Prometheus and Grafana stack. A PodMonitor scrapes metrics from Ray components, which are then ingested by Prometheus and visualized in Grafana for real-time analysis and alerting. In parallel, a Fluentd DaemonSet captures logs from Ray and application pods, forwarding them to AWS CloudWatch. These logs are then ingested into Splunk for centralized search, monitoring, and audit compliance. The apps are also monitored at cluster level by Dynatrace and Datadog to further enhance observability and monitoring capabilities.
To enable robust and disruption-free deployments, the platform uses a blue-green deployment pipeline built with Spinnaker. The pipeline includes progressive rollout stages, automated validation, manual approval gates, and rollback paths.
This robust system demonstrates a scalable, resilient, and observability-driven approach to deploying real-time inference and LLM workloads on Ray. Attendees will gain valuable insights into end-to-end development and deployment architecture, GPU workload optimization, and operationalization of Ray Serve in production environments.
Building a User-Friendly and Scalable Framework for Petabyte-Scale Multi-Modal Datasets on Ray
As LLM applications grow in complexity and scale, the demand for large-scale processing over multimodal datasets is rapidly increasing. Designing frameworks that are both user-friendly and capable of handling petabyte-scale multimodal data presents significant technical challenges.
Read more
As LLM applications grow in complexity and scale, the demand for large-scale processing over multimodal datasets is rapidly increasing. Designing frameworks that are both user-friendly and capable of handling petabyte-scale multimodal data presents significant technical challenges.
Read moreSuper App, Super Powers: How a Small Team Built Grab's Multi-Modal Foundation Model with Ray
At Grab, Southeast Asia's leading super app, a single user journey is a rich, multi-modal story. To understand our users holistically, we needed to learn from the complex web of their interactions across our diverse services. Our goal was to build a powerful user embedding foundation model that could capture this complete view, enhancing numerous downstream models and personalizing the user experience.
While off-the-shelf foundation models exist, they proved unsuitable. These generic models cannot comprehend the unique nature of a Grab user journey—a complex interplay of Grab-centric data like user_ids, specific locations, numerical data, and our merchants and drivers. To win, we had to build our own.
For a small team, this presented two significant hurdles: designing a novel model architecture to overcome these limitations, and tackling the immense infrastructure challenge. Unlike large corporations with dedicated engineering armies, we lacked the capacity to write and optimize large-scale distributed training code from scratch. The question was: how could we build a foundation model trained on terabytes of data to serve millions of users daily?
The solution for us was Ray. It delivered significant speedups to our entire workflow, enabling us to easily optimize pre-training on cost-effective heterogeneous clusters. Ray empowered our research scientists to focus on what they do best: rapidly prototyping architectures, testing improvements, and iterating on designs in pure Python.
In this talk, we will detail our novel transformer architecture, which unifies diverse data types using a unique key-value tokenization strategy and adapters to learn from multiple modalities. We will also highlight how Ray’s on-the-fly data preprocessing was critical for rapid development. Instead of running lengthy, separate preprocessing jobs, we could directly test new ideas—from tokenization changes to dataset filtering—within our training code, drastically accelerating our iteration cycles.
Finally, we will share how our pre-trained model's embeddings for long- and short-term user memory are now powering a wide range of critical downstream applications, including Churn Prediction, Ads Optimization, and Fraud Detection, as we create millions of embeddings daily for our users, merchants, and drivers.
Read more
At Grab, Southeast Asia's leading super app, a single user journey is a rich, multi-modal story. To understand our users holistically, we needed to learn from the complex web of their interactions across our diverse services. Our goal was to build a powerful user embedding foundation model that could capture this complete view, enhancing numerous downstream models and personalizing the user experience. While off-the-shelf foundation models exist, they proved unsuitable. These generic models cannot comprehend the unique nature of a Grab user journey—a complex interplay of Grab-centric data like user_ids, specific locations, numerical data, and our merchants and drivers. To win, we had to build our own. For a small team, this presented two significant hurdles: designing a novel model architecture to overcome these limitations, and tackling the immense infrastructure challenge. Unlike large corporations with dedicated engineering armies, we lacked the capacity to write and optimize large-scale distributed training code from scratch. The question was: how could we build a foundation model trained on terabytes of data to serve millions of users daily? The solution for us was Ray. It delivered significant speedups to our entire workflow, enabling us to easily optimize pre-training on cost-effective heterogeneous clusters. Ray empowered our research scientists to focus on what they do best: rapidly prototyping architectures, testing improvements, and iterating on designs in pure Python. In this talk, we will detail our novel transformer architecture, which unifies diverse data types using a unique key-value tokenization strategy and adapters to learn from multiple modalities. We will also highlight how Ray’s on-the-fly data preprocessing was critical for rapid development. Instead of running lengthy, separate preprocessing jobs, we could directly test new ideas—from tokenization changes to dataset filtering—within our training code, drastically accelerating our iteration cycles. Finally, we will share how our pre-trained model's embeddings for long- and short-term user memory are now powering a wide range of critical downstream applications, including Churn Prediction, Ads Optimization, and Fraud Detection, as we create millions of embeddings daily for our users, merchants, and drivers.
Read moreWhile off-the-shelf foundation models exist, they proved unsuitable. These generic models cannot comprehend the unique nature of a Grab user journey—a complex interplay of Grab-centric data like user_ids, specific locations, numerical data, and our merchants and drivers. To win, we had to build our own.
For a small team, this presented two significant hurdles: designing a novel model architecture to overcome these limitations, and tackling the immense infrastructure challenge. Unlike large corporations with dedicated engineering armies, we lacked the capacity to write and optimize large-scale distributed training code from scratch. The question was: how could we build a foundation model trained on terabytes of data to serve millions of users daily?
The solution for us was Ray. It delivered significant speedups to our entire workflow, enabling us to easily optimize pre-training on cost-effective heterogeneous clusters. Ray empowered our research scientists to focus on what they do best: rapidly prototyping architectures, testing improvements, and iterating on designs in pure Python.
In this talk, we will detail our novel transformer architecture, which unifies diverse data types using a unique key-value tokenization strategy and adapters to learn from multiple modalities. We will also highlight how Ray’s on-the-fly data preprocessing was critical for rapid development. Instead of running lengthy, separate preprocessing jobs, we could directly test new ideas—from tokenization changes to dataset filtering—within our training code, drastically accelerating our iteration cycles.
Finally, we will share how our pre-trained model's embeddings for long- and short-term user memory are now powering a wide range of critical downstream applications, including Churn Prediction, Ads Optimization, and Fraud Detection, as we create millions of embeddings daily for our users, merchants, and drivers.
JIT-Embedding with Ray Serve: Accelerating Large-Scale GenAI Foundation Model Training in Adobe Firefly
This presentation introduces JIT-Embedding (Just-in-Time Embedding), a novel solution designed to accelerate the training of foundational Generative AI (GenAI) models, with a focus on image and video generation in Adobe Firefly. By decoupling the expensive embedding computation from model training, JIT-Embedding enables these processes to scale independently. Built on Ray Serve, our architecture includes a robust JIT Service and JIT Client, seamlessly integrated with our Model Hub and Dataloader. The experiment results demonstrate that this approach significantly improved scalability, enabled higher-resolution and larger-scale GenAI Foundation Model training, and achieved notable performance gains and cost reductions. It's one of the innovations contributing to Firefly Video Model public release.
JIT-Embedding addresses several key challenges in large-scale foundation diffusion model training:
1. Slow on-the-fly embedding computation during training (e.g., VAE, CLIP, and T5 embeddings).
2. Long turnaround time required for offline embedding pre-computation.
3. High cost associated with recomputing embeddings using either approach.
4. Severe GPU memory constraints when training large models or processing high-resolution images/videos.
Our solution introduces several innovations to mitigate these issues:
1. JIT Service via Ray Serve: Wraps embedding computation as an on-demand service, deployable on underutilized lower-tier GPUs (e.g. A100), freeing up high-end GPUs (H100) for model training and optimizing resource allocation. The GPU memory requirements drop significantly for both side.
2. JIT Client with Dataloader Integration: Uses multiprocessing and prefetching to overlap embedding requests with training, effectively hiding latency of on-the-fly embedding computation and maximizing GPU utilization.
3. Efficient Serialization/Deserialization: We created a Rust + Python library, inspired by functional programming, to efficiently compresses multimodal data (e.g., images, videos, long text) to improve server–client communication throughput and flexibility.
4. Advanced Performance Optimization: Combines Ray Serve’s dashboards with our custom metrics, profiling, and load testing tools. We leverage advanced Ray features such as autoscaling, dynamic batching, and in-place model updates. Moreover, we came up with some key optimizations include client-side load balancing, faster video/image codecs in rust, overlapping CPU/GPU ops, and shared GPU usage across multiple models.
5. JIT Cache: Automatically stores computed embeddings for reuse across future training jobs, further reducing cost and computation time.
We plan to open source the JIT-Embedding solution, including the services, clients and Serialization/Deserialization library.
This talk will provide a comprehensive overview of the JIT-Embedding architecture, including the design of the JIT Service, JIT Client, Serialization/Deserialization, and the caching mechanism. We will present end-to-end experimental results from large-scale model training, showcasing the system’s scalability, performance enhancements, and cost efficiency. The session will conclude with key takeaways from our journey with Ray Serve and future directions for continued optimization.
Read more
This presentation introduces JIT-Embedding (Just-in-Time Embedding), a novel solution designed to accelerate the training of foundational Generative AI (GenAI) models, with a focus on image and video generation in Adobe Firefly. By decoupling the expensive embedding computation from model training, JIT-Embedding enables these processes to scale independently. Built on Ray Serve, our architecture includes a robust JIT Service and JIT Client, seamlessly integrated with our Model Hub and Dataloader. The experiment results demonstrate that this approach significantly improved scalability, enabled higher-resolution and larger-scale GenAI Foundation Model training, and achieved notable performance gains and cost reductions. It's one of the innovations contributing to Firefly Video Model public release. JIT-Embedding addresses several key challenges in large-scale foundation diffusion model training: 1. Slow on-the-fly embedding computation during training (e.g., VAE, CLIP, and T5 embeddings). 2. Long turnaround time required for offline embedding pre-computation. 3. High cost associated with recomputing embeddings using either approach. 4. Severe GPU memory constraints when training large models or processing high-resolution images/videos. Our solution introduces several innovations to mitigate these issues: 1. JIT Service via Ray Serve: Wraps embedding computation as an on-demand service, deployable on underutilized lower-tier GPUs (e.g. A100), freeing up high-end GPUs (H100) for model training and optimizing resource allocation. The GPU memory requirements drop significantly for both side. 2. JIT Client with Dataloader Integration: Uses multiprocessing and prefetching to overlap embedding requests with training, effectively hiding latency of on-the-fly embedding computation and maximizing GPU utilization. 3. Efficient Serialization/Deserialization: We created a Rust + Python library, inspired by functional programming, to efficiently compresses multimodal data (e.g., images, videos, long text) to improve server–client communication throughput and flexibility. 4. Advanced Performance Optimization: Combines Ray Serve’s dashboards with our custom metrics, profiling, and load testing tools. We leverage advanced Ray features such as autoscaling, dynamic batching, and in-place model updates. Moreover, we came up with some key optimizations include client-side load balancing, faster video/image codecs in rust, overlapping CPU/GPU ops, and shared GPU usage across multiple models. 5. JIT Cache: Automatically stores computed embeddings for reuse across future training jobs, further reducing cost and computation time. We plan to open source the JIT-Embedding solution, including the services, clients and Serialization/Deserialization library. This talk will provide a comprehensive overview of the JIT-Embedding architecture, including the design of the JIT Service, JIT Client, Serialization/Deserialization, and the caching mechanism. We will present end-to-end experimental results from large-scale model training, showcasing the system’s scalability, performance enhancements, and cost efficiency. The session will conclude with key takeaways from our journey with Ray Serve and future directions for continued optimization.
Read moreJIT-Embedding addresses several key challenges in large-scale foundation diffusion model training:
1. Slow on-the-fly embedding computation during training (e.g., VAE, CLIP, and T5 embeddings).
2. Long turnaround time required for offline embedding pre-computation.
3. High cost associated with recomputing embeddings using either approach.
4. Severe GPU memory constraints when training large models or processing high-resolution images/videos.
Our solution introduces several innovations to mitigate these issues:
1. JIT Service via Ray Serve: Wraps embedding computation as an on-demand service, deployable on underutilized lower-tier GPUs (e.g. A100), freeing up high-end GPUs (H100) for model training and optimizing resource allocation. The GPU memory requirements drop significantly for both side.
2. JIT Client with Dataloader Integration: Uses multiprocessing and prefetching to overlap embedding requests with training, effectively hiding latency of on-the-fly embedding computation and maximizing GPU utilization.
3. Efficient Serialization/Deserialization: We created a Rust + Python library, inspired by functional programming, to efficiently compresses multimodal data (e.g., images, videos, long text) to improve server–client communication throughput and flexibility.
4. Advanced Performance Optimization: Combines Ray Serve’s dashboards with our custom metrics, profiling, and load testing tools. We leverage advanced Ray features such as autoscaling, dynamic batching, and in-place model updates. Moreover, we came up with some key optimizations include client-side load balancing, faster video/image codecs in rust, overlapping CPU/GPU ops, and shared GPU usage across multiple models.
5. JIT Cache: Automatically stores computed embeddings for reuse across future training jobs, further reducing cost and computation time.
We plan to open source the JIT-Embedding solution, including the services, clients and Serialization/Deserialization library.
This talk will provide a comprehensive overview of the JIT-Embedding architecture, including the design of the JIT Service, JIT Client, Serialization/Deserialization, and the caching mechanism. We will present end-to-end experimental results from large-scale model training, showcasing the system’s scalability, performance enhancements, and cost efficiency. The session will conclude with key takeaways from our journey with Ray Serve and future directions for continued optimization.
Scaling training and inference for Autonomous Vehicles at Applied Intuition
Applied intuition has adopted Ray to solve scale up GPU accelerated workloads operating on petabytes scale of raw sensor data. Below are the three use cases that Ray is used within Applied.
1. Supervised learning with Ray train
Our ML platform allows users to run hyperparameter sweeps and distributed training, Ray train is the backbone to all of this.
2. Data processing with Ray data for inference and training
Last mile transformation as part of efficient data loading for sensor data. Samples are large (30-100MB) and require CPU intensive transformations. The platform scales up to 100s of CPU cores per job and provides generalized interface to stream samples from our data lake.
3. Open and closed loop RL with Rllib
We run a mixture of open loop and closed loop training for our RL use case. We needed to build a custom algorithm on top of the new Rllib api stack that colocates GPU processes, pipelines environment rollout operations and enables the hybrid (open / close loop) training regime.
Read more
Applied intuition has adopted Ray to solve scale up GPU accelerated workloads operating on petabytes scale of raw sensor data. Below are the three use cases that Ray is used within Applied. 1. Supervised learning with Ray train Our ML platform allows users to run hyperparameter sweeps and distributed training, Ray train is the backbone to all of this. 2. Data processing with Ray data for inference and training Last mile transformation as part of efficient data loading for sensor data. Samples are large (30-100MB) and require CPU intensive transformations. The platform scales up to 100s of CPU cores per job and provides generalized interface to stream samples from our data lake. 3. Open and closed loop RL with Rllib We run a mixture of open loop and closed loop training for our RL use case. We needed to build a custom algorithm on top of the new Rllib api stack that colocates GPU processes, pipelines environment rollout operations and enables the hybrid (open / close loop) training regime.
Read more1. Supervised learning with Ray train
Our ML platform allows users to run hyperparameter sweeps and distributed training, Ray train is the backbone to all of this.
2. Data processing with Ray data for inference and training
Last mile transformation as part of efficient data loading for sensor data. Samples are large (30-100MB) and require CPU intensive transformations. The platform scales up to 100s of CPU cores per job and provides generalized interface to stream samples from our data lake.
3. Open and closed loop RL with Rllib
We run a mixture of open loop and closed loop training for our RL use case. We needed to build a custom algorithm on top of the new Rllib api stack that colocates GPU processes, pipelines environment rollout operations and enables the hybrid (open / close loop) training regime.
Building RayLab: Autodesk's Journey to Scalable Deep Learning Infrastructure
In this presentation, we describe Autodesk's journey to enabling large-scale deep learning across the company. We began by exploring managed solutions like AWS Batch and SageMaker, but quickly ran into challenges around scalability, customization, networking, and developer experience. To overcome these limitations, we turned to Ray and KubeRay, which offered the flexibility and control we needed. Building on top of these technologies, we developed RayLab - Autodesk's internal platform for scalable training, data processing, and model serving.
RayLab is a Kubernetes-native platform that abstracts away infrastructure complexity while supporting secure, efficient, and user-friendly workflows. We built wrappers for Ray cluster management via a CLI, Web UI, and Python SDK to simplify usage, reduce onboarding friction, and ensure compliance with Autodesk's internal security and networking requirements. We'll describe the architecture behind RayLab, which includes Kubernetes, KubeRay, Karpenter, Grafana, and JupyterHub - all secured with role-based access control and designed for multi-tenancy to support ML workspaces across teams.
RayLab provides high-level APIs built on Ray and PyTorch Lightning, allowing users to launch distributed training jobs with minimal code. It includes standardized checkpointing and experiment tracking to support reproducibility and consistent workflows.
We'll also share the challenges we faced in improving the efficiency of our reserved H100 GPU resources, particularly around fair sharing across teams and users. To address this, we implemented quota management and a priority-based scheduling system that enables high-priority jobs to preempt lower-priority ones, significantly increasing utilization. Additionally, RayLab supports automatic downscaling of underutilized clusters to conserve compute.
Finally, we'll conclude with a live demo of RayLab in action.
Read more
In this presentation, we describe Autodesk's journey to enabling large-scale deep learning across the company. We began by exploring managed solutions like AWS Batch and SageMaker, but quickly ran into challenges around scalability, customization, networking, and developer experience. To overcome these limitations, we turned to Ray and KubeRay, which offered the flexibility and control we needed. Building on top of these technologies, we developed RayLab - Autodesk's internal platform for scalable training, data processing, and model serving. RayLab is a Kubernetes-native platform that abstracts away infrastructure complexity while supporting secure, efficient, and user-friendly workflows. We built wrappers for Ray cluster management via a CLI, Web UI, and Python SDK to simplify usage, reduce onboarding friction, and ensure compliance with Autodesk's internal security and networking requirements. We'll describe the architecture behind RayLab, which includes Kubernetes, KubeRay, Karpenter, Grafana, and JupyterHub - all secured with role-based access control and designed for multi-tenancy to support ML workspaces across teams. RayLab provides high-level APIs built on Ray and PyTorch Lightning, allowing users to launch distributed training jobs with minimal code. It includes standardized checkpointing and experiment tracking to support reproducibility and consistent workflows. We'll also share the challenges we faced in improving the efficiency of our reserved H100 GPU resources, particularly around fair sharing across teams and users. To address this, we implemented quota management and a priority-based scheduling system that enables high-priority jobs to preempt lower-priority ones, significantly increasing utilization. Additionally, RayLab supports automatic downscaling of underutilized clusters to conserve compute. Finally, we'll conclude with a live demo of RayLab in action.
Read moreRayLab is a Kubernetes-native platform that abstracts away infrastructure complexity while supporting secure, efficient, and user-friendly workflows. We built wrappers for Ray cluster management via a CLI, Web UI, and Python SDK to simplify usage, reduce onboarding friction, and ensure compliance with Autodesk's internal security and networking requirements. We'll describe the architecture behind RayLab, which includes Kubernetes, KubeRay, Karpenter, Grafana, and JupyterHub - all secured with role-based access control and designed for multi-tenancy to support ML workspaces across teams.
RayLab provides high-level APIs built on Ray and PyTorch Lightning, allowing users to launch distributed training jobs with minimal code. It includes standardized checkpointing and experiment tracking to support reproducibility and consistent workflows.
We'll also share the challenges we faced in improving the efficiency of our reserved H100 GPU resources, particularly around fair sharing across teams and users. To address this, we implemented quota management and a priority-based scheduling system that enables high-priority jobs to preempt lower-priority ones, significantly increasing utilization. Additionally, RayLab supports automatic downscaling of underutilized clusters to conserve compute.
Finally, we'll conclude with a live demo of RayLab in action.
Scaling LLM Post-Training at Character.ai
Character AI is the world's leading application for AI entertainment, serving tens of millions of users per day with large language models (LLMs). To continuously improve the models that power our AI Characters, we have built a robust and scalable post-training stack entirely on open-source technologies in the Ray ecosystem. This stack, internally named Rayman, has allowed us to accelerate our model development velocity and rapidly iterate on the latest RL techniques. In this talk, we will detail the architecture of Rayman, the open-source projects we leverage and the ML and systems challenges we've overcome.
Specifically, we will cover:
1. Infrastructure for Supervised Finetuning (SFT) and Distillation: We will introduce Rayman, our internal framework built on Ray Data, Ray Train, and Deepspeed/PyTorch FSDP for orchestrating all distributed workloads. We'll detail how we use this for large-scale SFT, including our strategy for training massive Mixture-of-Experts (MoE) models like those from DeepSeek. We will also go over our approach for synthetic data generation and knowledge distillation of state of the art open source LLMs into smaller, more tractable models.
2. Reinforcement Learning from Real User Feedback: A core challenge in aligning models for open-ended creative dialogue is that there are no verifiable rewards. We will discuss how we tackle this problem by training our own reward models on real user interaction data which we then use for RL. We'll detail our infrastructure for this process built on Ray, which allows us to translate noisy, real-world user feedback into a clear signal that can be effectively """"hill-climbed"""" using a variety of reinforcement learning techniques to significantly improve the quality of our models.
Read more
Character AI is the world's leading application for AI entertainment, serving tens of millions of users per day with large language models (LLMs). To continuously improve the models that power our AI Characters, we have built a robust and scalable post-training stack entirely on open-source technologies in the Ray ecosystem. This stack, internally named Rayman, has allowed us to accelerate our model development velocity and rapidly iterate on the latest RL techniques. In this talk, we will detail the architecture of Rayman, the open-source projects we leverage and the ML and systems challenges we've overcome. Specifically, we will cover: 1. Infrastructure for Supervised Finetuning (SFT) and Distillation: We will introduce Rayman, our internal framework built on Ray Data, Ray Train, and Deepspeed/PyTorch FSDP for orchestrating all distributed workloads. We'll detail how we use this for large-scale SFT, including our strategy for training massive Mixture-of-Experts (MoE) models like those from DeepSeek. We will also go over our approach for synthetic data generation and knowledge distillation of state of the art open source LLMs into smaller, more tractable models. 2. Reinforcement Learning from Real User Feedback: A core challenge in aligning models for open-ended creative dialogue is that there are no verifiable rewards. We will discuss how we tackle this problem by training our own reward models on real user interaction data which we then use for RL. We'll detail our infrastructure for this process built on Ray, which allows us to translate noisy, real-world user feedback into a clear signal that can be effectively """"hill-climbed"""" using a variety of reinforcement learning techniques to significantly improve the quality of our models.
Read moreSpecifically, we will cover:
1. Infrastructure for Supervised Finetuning (SFT) and Distillation: We will introduce Rayman, our internal framework built on Ray Data, Ray Train, and Deepspeed/PyTorch FSDP for orchestrating all distributed workloads. We'll detail how we use this for large-scale SFT, including our strategy for training massive Mixture-of-Experts (MoE) models like those from DeepSeek. We will also go over our approach for synthetic data generation and knowledge distillation of state of the art open source LLMs into smaller, more tractable models.
2. Reinforcement Learning from Real User Feedback: A core challenge in aligning models for open-ended creative dialogue is that there are no verifiable rewards. We will discuss how we tackle this problem by training our own reward models on real user interaction data which we then use for RL. We'll detail our infrastructure for this process built on Ray, which allows us to translate noisy, real-world user feedback into a clear signal that can be effectively """"hill-climbed"""" using a variety of reinforcement learning techniques to significantly improve the quality of our models.
A foundation model for each enterprise
Enterprises possess vast and diverse datasets, yet their predictive modeling capabilities are often fragmented into siloed, single-task systems. This approach creates redundant feature engineering, incurs excessive training costs, and lacks the flexibility to efficiently answer new business questions. We argue, based on our experience at Dream1, that the paradigm of large-scale, pre-trained foundation models can be extended beyond the public domain to create a single, cohesive intelligence engine for any enterprise. This talk introduces Lumos, a framework for building enterprise-specific foundation models, and details how Ray was instrumental in surmounting the significant scaling challenges involved.
The core of Lumos is a multi-task, multi-timestep Transformer architecture designed to forecast a wide array of user behaviors (e.g., churn, engagement, lifetime value) by ingesting historical user transactions, user attributes, and the calendar of future business events (supply). Our architecture introduces a strong inductive bias through a cross-attention mechanism, where a decoder conditioned on future supply events attends to an encoder that has processed the full history of user behavior. We will also share our work on formulating the equivalents of scaling laws and emergent abilities in the realm of enterprise transaction data.
Read more
Enterprises possess vast and diverse datasets, yet their predictive modeling capabilities are often fragmented into siloed, single-task systems. This approach creates redundant feature engineering, incurs excessive training costs, and lacks the flexibility to efficiently answer new business questions. We argue, based on our experience at Dream1, that the paradigm of large-scale, pre-trained foundation models can be extended beyond the public domain to create a single, cohesive intelligence engine for any enterprise. This talk introduces Lumos, a framework for building enterprise-specific foundation models, and details how Ray was instrumental in surmounting the significant scaling challenges involved. The core of Lumos is a multi-task, multi-timestep Transformer architecture designed to forecast a wide array of user behaviors (e.g., churn, engagement, lifetime value) by ingesting historical user transactions, user attributes, and the calendar of future business events (supply). Our architecture introduces a strong inductive bias through a cross-attention mechanism, where a decoder conditioned on future supply events attends to an encoder that has processed the full history of user behavior. We will also share our work on formulating the equivalents of scaling laws and emergent abilities in the realm of enterprise transaction data.
Read moreThe core of Lumos is a multi-task, multi-timestep Transformer architecture designed to forecast a wide array of user behaviors (e.g., churn, engagement, lifetime value) by ingesting historical user transactions, user attributes, and the calendar of future business events (supply). Our architecture introduces a strong inductive bias through a cross-attention mechanism, where a decoder conditioned on future supply events attends to an encoder that has processed the full history of user behavior. We will also share our work on formulating the equivalents of scaling laws and emergent abilities in the realm of enterprise transaction data.
Optimizing Video AI at Scale: Cost-Effective ML Operations with Geotab and Anyscale Ray
Processing and deriving intelligence from billions of frames of video data captured by Geotab cameras can be a resource-intensive task. This presentation will share Geotab's journey of building a cost-efficient and highly automated Smart Video Platform utilizing Anyscale Ray. We will showcase how Ray serves as the backbone for hosting and orchestrating our machine learning models for video analysis, enabling both efficient real-time inference and batch processing. A key focus will be on our automated training and validation workflows, which leverage Ray's distributed capabilities to dramatically reduce the time and cost associated with model development and deployment. Learn how Geotab is achieving significant operational savings and accelerating innovation in video analytics through a strategic embrace of Anyscale Ray.
Read more
Processing and deriving intelligence from billions of frames of video data captured by Geotab cameras can be a resource-intensive task. This presentation will share Geotab's journey of building a cost-efficient and highly automated Smart Video Platform utilizing Anyscale Ray. We will showcase how Ray serves as the backbone for hosting and orchestrating our machine learning models for video analysis, enabling both efficient real-time inference and batch processing. A key focus will be on our automated training and validation workflows, which leverage Ray's distributed capabilities to dramatically reduce the time and cost associated with model development and deployment. Learn how Geotab is achieving significant operational savings and accelerating innovation in video analytics through a strategic embrace of Anyscale Ray.
Read moreTraining Perception Models at Scale
At Latitude, Ray powers dataset generation, inference, and complex training pipelines for multi-modal perception models. This talk walks through how we use Ray's distributed computing capabilities across our entire ML lifecycle, with a focus on our most demanding use case: the training pipeline.
Our training pipeline optimizes models using images, point clouds, rich metadata, and multi-task targets. Loading data fast enough to fully utilize GPUs was non-trivial. Starting with a PyTorch DataLoader, we faced constraints that limited concurrency and speed. Migrating to Ray Data unlocked greater parallelism but exposed new bottlenecks in data loading, memory management, serialization, and prefetching. This talk will share the optimizations and patterns that let us overcome these challenges and boost training throughput by over 3x. We’ll also talk about some of the hurdles we faced, including implementing sampling and improving observability into the Ray Data pipeline.
This session is ideal for practitioners working on large-scale training pipelines with multi-modal data.
Read more
At Latitude, Ray powers dataset generation, inference, and complex training pipelines for multi-modal perception models. This talk walks through how we use Ray's distributed computing capabilities across our entire ML lifecycle, with a focus on our most demanding use case: the training pipeline. Our training pipeline optimizes models using images, point clouds, rich metadata, and multi-task targets. Loading data fast enough to fully utilize GPUs was non-trivial. Starting with a PyTorch DataLoader, we faced constraints that limited concurrency and speed. Migrating to Ray Data unlocked greater parallelism but exposed new bottlenecks in data loading, memory management, serialization, and prefetching. This talk will share the optimizations and patterns that let us overcome these challenges and boost training throughput by over 3x. We’ll also talk about some of the hurdles we faced, including implementing sampling and improving observability into the Ray Data pipeline. This session is ideal for practitioners working on large-scale training pipelines with multi-modal data.
Read moreOur training pipeline optimizes models using images, point clouds, rich metadata, and multi-task targets. Loading data fast enough to fully utilize GPUs was non-trivial. Starting with a PyTorch DataLoader, we faced constraints that limited concurrency and speed. Migrating to Ray Data unlocked greater parallelism but exposed new bottlenecks in data loading, memory management, serialization, and prefetching. This talk will share the optimizations and patterns that let us overcome these challenges and boost training throughput by over 3x. We’ll also talk about some of the hurdles we faced, including implementing sampling and improving observability into the Ray Data pipeline.
This session is ideal for practitioners working on large-scale training pipelines with multi-modal data.
Scaling LinkedIn's Online Training Solution with Ray
Hi team, we are ML Training Platform team in LinkedIn. We have adopted Ray as our solution to scale up online training solution for our real-time recommendation system in """"Jobs you might be interested in"""" ranking model.
## What is Online Training?
Online training system, in layman's understanding, is a ML-based “feedback loop” for recommendation system / real time serving system that requires fast response adapting to user’s behavior and worldview.
## Why Online Training?
Model Freshness: Most of deep learning models in LinkedIn’s recommender system were static models and were trained offline. It takes days, weeks, even quarters to reflect updated user focus and trending items, thus cause us to miss potential marketing opportunities and user retentions (ex. LinedIn Ads, Job Recommendation).
Feature Consistency: Offline feature engineering might also cause out-of-sync issues during offline relational joining. The offline dataset used for feature join was ETLed either before user activity (causing feature to be out-dated), or after the user activity was logged (causing label leakage). Both of these issues will cause the model being in-accurate, and negatively affecting the recommendation results.
## What is Business Impact?
Better Recommendation: Tight feedback loop can provide improved quality in recommendations. Better Click-Through-Rate (CTR) will give better market opportunities. which in turn translates to more sales opportunities and ultimately, higher revenue (Proven in many companies ex. Meta, Google, TikTok, Twitter, Grubhub, etc.)
Lower Compuation Cost: Incrementally train on previous model will avoid additional cost to train a model from scratch. Result from Incremental Training shows 8.9x computation cost saving!
## How did we build online training system?
Our system is consist of following key items:
1. Streaming training data generation: Training data generation is a two-part process: Nearline Feature Attribution Generation which produces fully attributed labels without any sampling, ensuring high-quality, detailed training labels; and Data Transformation Service: which takes the output from the nearline attribution stage and converts it into a format compatible with our trainer component. It also performs downsampling to optimize the training dataset size.
2. Ray-based Scalable kafka training data ingestion: The scalable Kafka data ingestion system achieves high throughput, handling over 10K records per second through performance optimizations like Ray-based message prefetching and fast Pythonic data loading.
3.Fast model post-processing solution: we have also improved the model post-processing (calibration, serving config rewriting, etc.) process to a direct Light-weight Restli based model publishing solution without going through a duplicate model rewriting process in expensive offline jobs, which reduced e2e time from a trained model to model publishing & serving from ~4h to ~ 15 min.
Read more
Hi team, we are ML Training Platform team in LinkedIn. We have adopted Ray as our solution to scale up online training solution for our real-time recommendation system in """"Jobs you might be interested in"""" ranking model. ## What is Online Training? Online training system, in layman's understanding, is a ML-based “feedback loop” for recommendation system / real time serving system that requires fast response adapting to user’s behavior and worldview. ## Why Online Training? Model Freshness: Most of deep learning models in LinkedIn’s recommender system were static models and were trained offline. It takes days, weeks, even quarters to reflect updated user focus and trending items, thus cause us to miss potential marketing opportunities and user retentions (ex. LinedIn Ads, Job Recommendation). Feature Consistency: Offline feature engineering might also cause out-of-sync issues during offline relational joining. The offline dataset used for feature join was ETLed either before user activity (causing feature to be out-dated), or after the user activity was logged (causing label leakage). Both of these issues will cause the model being in-accurate, and negatively affecting the recommendation results. ## What is Business Impact? Better Recommendation: Tight feedback loop can provide improved quality in recommendations. Better Click-Through-Rate (CTR) will give better market opportunities. which in turn translates to more sales opportunities and ultimately, higher revenue (Proven in many companies ex. Meta, Google, TikTok, Twitter, Grubhub, etc.) Lower Compuation Cost: Incrementally train on previous model will avoid additional cost to train a model from scratch. Result from Incremental Training shows 8.9x computation cost saving! ## How did we build online training system? Our system is consist of following key items: 1. Streaming training data generation: Training data generation is a two-part process: Nearline Feature Attribution Generation which produces fully attributed labels without any sampling, ensuring high-quality, detailed training labels; and Data Transformation Service: which takes the output from the nearline attribution stage and converts it into a format compatible with our trainer component. It also performs downsampling to optimize the training dataset size. 2. Ray-based Scalable kafka training data ingestion: The scalable Kafka data ingestion system achieves high throughput, handling over 10K records per second through performance optimizations like Ray-based message prefetching and fast Pythonic data loading. 3.Fast model post-processing solution: we have also improved the model post-processing (calibration, serving config rewriting, etc.) process to a direct Light-weight Restli based model publishing solution without going through a duplicate model rewriting process in expensive offline jobs, which reduced e2e time from a trained model to model publishing & serving from ~4h to ~ 15 min.
Read more## What is Online Training?
Online training system, in layman's understanding, is a ML-based “feedback loop” for recommendation system / real time serving system that requires fast response adapting to user’s behavior and worldview.
## Why Online Training?
Model Freshness: Most of deep learning models in LinkedIn’s recommender system were static models and were trained offline. It takes days, weeks, even quarters to reflect updated user focus and trending items, thus cause us to miss potential marketing opportunities and user retentions (ex. LinedIn Ads, Job Recommendation).
Feature Consistency: Offline feature engineering might also cause out-of-sync issues during offline relational joining. The offline dataset used for feature join was ETLed either before user activity (causing feature to be out-dated), or after the user activity was logged (causing label leakage). Both of these issues will cause the model being in-accurate, and negatively affecting the recommendation results.
## What is Business Impact?
Better Recommendation: Tight feedback loop can provide improved quality in recommendations. Better Click-Through-Rate (CTR) will give better market opportunities. which in turn translates to more sales opportunities and ultimately, higher revenue (Proven in many companies ex. Meta, Google, TikTok, Twitter, Grubhub, etc.)
Lower Compuation Cost: Incrementally train on previous model will avoid additional cost to train a model from scratch. Result from Incremental Training shows 8.9x computation cost saving!
## How did we build online training system?
Our system is consist of following key items:
1. Streaming training data generation: Training data generation is a two-part process: Nearline Feature Attribution Generation which produces fully attributed labels without any sampling, ensuring high-quality, detailed training labels; and Data Transformation Service: which takes the output from the nearline attribution stage and converts it into a format compatible with our trainer component. It also performs downsampling to optimize the training dataset size.
2. Ray-based Scalable kafka training data ingestion: The scalable Kafka data ingestion system achieves high throughput, handling over 10K records per second through performance optimizations like Ray-based message prefetching and fast Pythonic data loading.
3.Fast model post-processing solution: we have also improved the model post-processing (calibration, serving config rewriting, etc.) process to a direct Light-weight Restli based model publishing solution without going through a duplicate model rewriting process in expensive offline jobs, which reduced e2e time from a trained model to model publishing & serving from ~4h to ~ 15 min.
Powering Autonomous Vehicles: Motional's Blueprint for Scalable, Performant, and Reliable ML Systems with Ray
At Motional, processing terabyte-scale data from our autonomous vehicle fleet for ML model development presents significant challenges in scalability, performance, and cost. Our legacy systems, involving manual processes, monolithic architectures like SQLite, and expensive Spark jobs, created bottlenecks that delayed feature engineering and model iteration by weeks. To overcome these hurdles, we engineered a unified, horizontally-scalable ML system design built on Ray, fundamentally transforming our ML development lifecycle.
This talk presents our blueprint for building reliable and performant data processing pipelines for large-scale detection, caching, and prediction feature generation. We will detail our migration from a cumbersome SQLite-based caching job to a fully autoscaled Ray implementation, which reduced data preparation time from over two weeks to a single day. Furthermore, we will showcase how we replaced a costly Spark/EMR pipeline with a Ray-native solution, achieving over 100x cost savings in feature caching.
We will share deep insights into our use of the Ray Actor pattern, highlighting a novel """"1-actor-per-node"""" design. This pattern is instrumental in executing the principle of bringing compute and data together, thereby eliminating costly network communication overhead and enabling on-demand data handling. We believe this pattern is a valuable contribution to the community and a strong candidate for inclusion in the official Ray Patterns documentation. Attendees will leave with practical strategies for designing and implementing scalable, performant, and reliable ML systems to tackle complex, large-scale data challenges in any industry.
Read more
At Motional, processing terabyte-scale data from our autonomous vehicle fleet for ML model development presents significant challenges in scalability, performance, and cost. Our legacy systems, involving manual processes, monolithic architectures like SQLite, and expensive Spark jobs, created bottlenecks that delayed feature engineering and model iteration by weeks. To overcome these hurdles, we engineered a unified, horizontally-scalable ML system design built on Ray, fundamentally transforming our ML development lifecycle. This talk presents our blueprint for building reliable and performant data processing pipelines for large-scale detection, caching, and prediction feature generation. We will detail our migration from a cumbersome SQLite-based caching job to a fully autoscaled Ray implementation, which reduced data preparation time from over two weeks to a single day. Furthermore, we will showcase how we replaced a costly Spark/EMR pipeline with a Ray-native solution, achieving over 100x cost savings in feature caching. We will share deep insights into our use of the Ray Actor pattern, highlighting a novel """"1-actor-per-node"""" design. This pattern is instrumental in executing the principle of bringing compute and data together, thereby eliminating costly network communication overhead and enabling on-demand data handling. We believe this pattern is a valuable contribution to the community and a strong candidate for inclusion in the official Ray Patterns documentation. Attendees will leave with practical strategies for designing and implementing scalable, performant, and reliable ML systems to tackle complex, large-scale data challenges in any industry.
Read moreThis talk presents our blueprint for building reliable and performant data processing pipelines for large-scale detection, caching, and prediction feature generation. We will detail our migration from a cumbersome SQLite-based caching job to a fully autoscaled Ray implementation, which reduced data preparation time from over two weeks to a single day. Furthermore, we will showcase how we replaced a costly Spark/EMR pipeline with a Ray-native solution, achieving over 100x cost savings in feature caching.
We will share deep insights into our use of the Ray Actor pattern, highlighting a novel """"1-actor-per-node"""" design. This pattern is instrumental in executing the principle of bringing compute and data together, thereby eliminating costly network communication overhead and enabling on-demand data handling. We believe this pattern is a valuable contribution to the community and a strong candidate for inclusion in the official Ray Patterns documentation. Attendees will leave with practical strategies for designing and implementing scalable, performant, and reliable ML systems to tackle complex, large-scale data challenges in any industry.
Accelerating AI at Scale - Motive's Journey with Anyscale and Ray
Motive, a leader in fleet management and IoT technology, powers millions of vehicles
worldwide with AI-driven safety solutions through our cutting-edge AI DashCam technology.
Facing exponential growth in data volume from dashcam video streams, telematics data, and
real-time event detection, Motive confronted significant scalability and productivity challenges in
AI model training and testing workflows.
In this talk, we share our transformative journey leveraging Anyscale and Ray to drastically
accelerate our AI productivity and innovation velocity. Specifically, by migrating our production
model training workloads onto managed Ray clusters via Anyscale, we achieved a 10x
reduction in model training time. Similarly, our Testinfra back-testing workflows, crucial for
validating model updates and ensuring reliable deployment, experienced a remarkable 10x
improvement in processing speed.
We detail the architectural decisions behind our overall AI infrastructure, including:
● Efficient GPU resource allocation strategies using Ray's distributed computing
capabilities.
● Clear delineation of Proof-of-Concept (PoC) and Production clusters for improved
governance and cost control.
● Modular restructuring of our batch inference and testing workflows within Ray, enhancing
maintainability and scalability.
Attendees will gain insights into practical strategies for successfully implementing Ray at scale,
learn best practices for infrastructure and workflow optimization, and understand how Anyscale
and Ray can significantly accelerate AI productivity in production environments.
Join us to explore how Motive harnessed the power of Ray and Anyscale to drive innovation,
improve safety outcomes at scale, and transform productivity for millions of connected vehicles
globally.
Read more
Motive, a leader in fleet management and IoT technology, powers millions of vehicles worldwide with AI-driven safety solutions through our cutting-edge AI DashCam technology. Facing exponential growth in data volume from dashcam video streams, telematics data, and real-time event detection, Motive confronted significant scalability and productivity challenges in AI model training and testing workflows. In this talk, we share our transformative journey leveraging Anyscale and Ray to drastically accelerate our AI productivity and innovation velocity. Specifically, by migrating our production model training workloads onto managed Ray clusters via Anyscale, we achieved a 10x reduction in model training time. Similarly, our Testinfra back-testing workflows, crucial for validating model updates and ensuring reliable deployment, experienced a remarkable 10x improvement in processing speed. We detail the architectural decisions behind our overall AI infrastructure, including: ● Efficient GPU resource allocation strategies using Ray's distributed computing capabilities. ● Clear delineation of Proof-of-Concept (PoC) and Production clusters for improved governance and cost control. ● Modular restructuring of our batch inference and testing workflows within Ray, enhancing maintainability and scalability. Attendees will gain insights into practical strategies for successfully implementing Ray at scale, learn best practices for infrastructure and workflow optimization, and understand how Anyscale and Ray can significantly accelerate AI productivity in production environments. Join us to explore how Motive harnessed the power of Ray and Anyscale to drive innovation, improve safety outcomes at scale, and transform productivity for millions of connected vehicles globally.
Read moreworldwide with AI-driven safety solutions through our cutting-edge AI DashCam technology.
Facing exponential growth in data volume from dashcam video streams, telematics data, and
real-time event detection, Motive confronted significant scalability and productivity challenges in
AI model training and testing workflows.
In this talk, we share our transformative journey leveraging Anyscale and Ray to drastically
accelerate our AI productivity and innovation velocity. Specifically, by migrating our production
model training workloads onto managed Ray clusters via Anyscale, we achieved a 10x
reduction in model training time. Similarly, our Testinfra back-testing workflows, crucial for
validating model updates and ensuring reliable deployment, experienced a remarkable 10x
improvement in processing speed.
We detail the architectural decisions behind our overall AI infrastructure, including:
● Efficient GPU resource allocation strategies using Ray's distributed computing
capabilities.
● Clear delineation of Proof-of-Concept (PoC) and Production clusters for improved
governance and cost control.
● Modular restructuring of our batch inference and testing workflows within Ray, enhancing
maintainability and scalability.
Attendees will gain insights into practical strategies for successfully implementing Ray at scale,
learn best practices for infrastructure and workflow optimization, and understand how Anyscale
and Ray can significantly accelerate AI productivity in production environments.
Join us to explore how Motive harnessed the power of Ray and Anyscale to drive innovation,
improve safety outcomes at scale, and transform productivity for millions of connected vehicles
globally.
Mako: Netflix's Next Generation ML Training Platform
At Netflix, we are building Mako, a new ML training platform designed to meet the demands of modern AI workloads. In this talk, we will share how we evolved our training platform, improved GPU efficiency using a custom scheduler, and made key architecture changes to support large-scale training. We will also cover how Ray fits into this journey and what we learned along the way.
Read more
At Netflix, we are building Mako, a new ML training platform designed to meet the demands of modern AI workloads. In this talk, we will share how we evolved our training platform, improved GPU efficiency using a custom scheduler, and made key architecture changes to support large-scale training. We will also cover how Ray fits into this journey and what we learned along the way.
Read moreRay @ Robinhood: Building a Seamless Distributed ML Training Platform on Kubernetes with KubeRay
As Robinhood’s machine learning needs have evolved to require training using larger models and datasets, we started to reach the limits of what could be achieved on single node training. As a result distributed training became an essential capability to support on our ML platform. In this session, we’ll share how Robinhood evaluated and adopted KubeRay to support large-scale distributed training workloads. We’ll dive into the architectural decisions, integration with our existing ML training stack, and the platform-level abstractions we built to make distributed training seamless and accessible for our internal ML teams. We’ll also discuss some of the aspects of Robinhood’s Kubernetes infrastructure that led to our decision to adopt certain aspects of KubeRay and where we decided to substitute components based on our unique needs. This session will offer practical insights and lessons learned on integrating Ray into your ML platform.
Read more
As Robinhood’s machine learning needs have evolved to require training using larger models and datasets, we started to reach the limits of what could be achieved on single node training. As a result distributed training became an essential capability to support on our ML platform. In this session, we’ll share how Robinhood evaluated and adopted KubeRay to support large-scale distributed training workloads. We’ll dive into the architectural decisions, integration with our existing ML training stack, and the platform-level abstractions we built to make distributed training seamless and accessible for our internal ML teams. We’ll also discuss some of the aspects of Robinhood’s Kubernetes infrastructure that led to our decision to adopt certain aspects of KubeRay and where we decided to substitute components based on our unique needs. This session will offer practical insights and lessons learned on integrating Ray into your ML platform.
Read moreHow Roblox Leverages Ray Infrastructure to Train 3D Foundation Models
In this talk, we present how Roblox built an ML platform on Ray and used it to train our 3D foundation model. We will talk about integrating Kuberay with Istio and Kubeflow for authentication and multi-tenancy, open sourcing the KubeRay dashboard with support for iterative development, accelerating docker images with p2p distribution and lazy pulling, and scaling Ray jobs to multiple clusters. We’ll then talk about the challenges of applying Ray to training foundation models, including running batch LLM labeling jobs and using Ray Data at scale. At Roblox, distributed training job were previously launched through MPI. While MPI can serve the core functions of launching training job on multiple gpus, it lacks essential functionality including observability and fault tolerance. In the past year as a platform team, we explored adopting Ray Train as our default distributed training framework and successfully converted the majority of distributed training use cases to Ray.
Read more
In this talk, we present how Roblox built an ML platform on Ray and used it to train our 3D foundation model. We will talk about integrating Kuberay with Istio and Kubeflow for authentication and multi-tenancy, open sourcing the KubeRay dashboard with support for iterative development, accelerating docker images with p2p distribution and lazy pulling, and scaling Ray jobs to multiple clusters. We’ll then talk about the challenges of applying Ray to training foundation models, including running batch LLM labeling jobs and using Ray Data at scale. At Roblox, distributed training job were previously launched through MPI. While MPI can serve the core functions of launching training job on multiple gpus, it lacks essential functionality including observability and fault tolerance. In the past year as a platform team, we explored adopting Ray Train as our default distributed training framework and successfully converted the majority of distributed training use cases to Ray.
Read moreRayComfyUI: Scalable, Distributed ComfyUI Powered by Ray
ComfyUI is a leading AI workflow builder in the generative AI community, particularly well-known for its visual and vision-focused applications. However, its original architecture is limited to single-machine, single-user scenarios—making it difficult to scale for larger models or broader user bases.
RayComfyUI addresses these limitations by leveraging the power of Ray to enable distributed execution and scalable serving of ComfyUI workflows.
Read more
ComfyUI is a leading AI workflow builder in the generative AI community, particularly well-known for its visual and vision-focused applications. However, its original architecture is limited to single-machine, single-user scenarios—making it difficult to scale for larger models or broader user bases. RayComfyUI addresses these limitations by leveraging the power of Ray to enable distributed execution and scalable serving of ComfyUI workflows.
Read moreRayComfyUI addresses these limitations by leveraging the power of Ray to enable distributed execution and scalable serving of ComfyUI workflows.
Two Sigma's Road to Large-Scale Model Training through Ray
Machine learning (ML) researchers at Two Sigma are confronted with the challenge of efficiently training a variety of ML models on large-scale financial data which can be up to multiple TBs in size. This also includes Large Language Model (LLM) finetuning and inference. In this presentation, we explain how we were able to introduce and integrate Ray into the ML research environment here at Two Sigma by overcoming multiple challenges including: integrating Ray with internal frameworks such as our cloud compute scheduler, network and security considerations, user experience around cluster configuration and management, and productionization of research pipelines built on Ray. Ultimately, we were able to create a new distributed compute engine for ML powered by Ray which provides a seamless user experience for efficient training of ML models on large-scale financial data.
Read more
Machine learning (ML) researchers at Two Sigma are confronted with the challenge of efficiently training a variety of ML models on large-scale financial data which can be up to multiple TBs in size. This also includes Large Language Model (LLM) finetuning and inference. In this presentation, we explain how we were able to introduce and integrate Ray into the ML research environment here at Two Sigma by overcoming multiple challenges including: integrating Ray with internal frameworks such as our cloud compute scheduler, network and security considerations, user experience around cluster configuration and management, and productionization of research pipelines built on Ray. Ultimately, we were able to create a new distributed compute engine for ML powered by Ray which provides a seamless user experience for efficient training of ML models on large-scale financial data.
Read moreRevolutionizing Model Serving with a 50x Cost Reduction using Ray Serve at Workday
Workday uses a tenanted, regionalized architecture in order to ensure data isolation and in-region execution, both of which are crucial requirements for our customers. In early 2023, facing challenges with the ever-increasing scale and cost required to serve dedicated ML models for every tenant in every environment, we decided to completely redo how we serve models using a hot new technology: Ray! We now use Ray Serve to serve tens of thousands of ML models across more than a dozen environments. Ray Serve’s inherent capabilities of per-deployment autoscaling and efficient request routing have enabled 50x cost reductions compared to our previous systems while maintaining high availability and low latency.
But it wasn’t all smooth sailing: Ray Serve was not originally designed for some of the odd ways our system works, and we ran into scaling problems with only a few dozen applications deployed. Thankfully, Ray is open-source, and we were able to dramatically improve the scalability of Ray Serve through a series of contributions we made back to the community. Today, Ray Serve can handle thousands of applications per cluster.
In this talk, we’ll discuss how we use Ray Serve to implement model serving at Workday. We’ll dive deep into our off-the-beaten-path usage of Ray Serve, the challenges we faced with scaling up our system, and the improvements we’ve contributed back to Ray to solve them. You’ll come away from this talk with a deeper understanding of how Ray Serve works and how to build complex systems on top of it – and maybe even contribute back to it!
Read more
Workday uses a tenanted, regionalized architecture in order to ensure data isolation and in-region execution, both of which are crucial requirements for our customers. In early 2023, facing challenges with the ever-increasing scale and cost required to serve dedicated ML models for every tenant in every environment, we decided to completely redo how we serve models using a hot new technology: Ray! We now use Ray Serve to serve tens of thousands of ML models across more than a dozen environments. Ray Serve’s inherent capabilities of per-deployment autoscaling and efficient request routing have enabled 50x cost reductions compared to our previous systems while maintaining high availability and low latency. But it wasn’t all smooth sailing: Ray Serve was not originally designed for some of the odd ways our system works, and we ran into scaling problems with only a few dozen applications deployed. Thankfully, Ray is open-source, and we were able to dramatically improve the scalability of Ray Serve through a series of contributions we made back to the community. Today, Ray Serve can handle thousands of applications per cluster. In this talk, we’ll discuss how we use Ray Serve to implement model serving at Workday. We’ll dive deep into our off-the-beaten-path usage of Ray Serve, the challenges we faced with scaling up our system, and the improvements we’ve contributed back to Ray to solve them. You’ll come away from this talk with a deeper understanding of how Ray Serve works and how to build complex systems on top of it – and maybe even contribute back to it!
Read moreBut it wasn’t all smooth sailing: Ray Serve was not originally designed for some of the odd ways our system works, and we ran into scaling problems with only a few dozen applications deployed. Thankfully, Ray is open-source, and we were able to dramatically improve the scalability of Ray Serve through a series of contributions we made back to the community. Today, Ray Serve can handle thousands of applications per cluster.
In this talk, we’ll discuss how we use Ray Serve to implement model serving at Workday. We’ll dive deep into our off-the-beaten-path usage of Ray Serve, the challenges we faced with scaling up our system, and the improvements we’ve contributed back to Ray to solve them. You’ll come away from this talk with a deeper understanding of how Ray Serve works and how to build complex systems on top of it – and maybe even contribute back to it!
Marin: Open Development of Open Foundation Models
Open-source software thrives because its entire lifecycle—code, tests, debates, even missteps—remains publicly writable. Foundation models rarely meet that standard: most “open-weight” releases omit the training code, data recipe, and experiment logs that make results reproducible.
Marin closes that gap. Every run begins life as a GitHub issue and pull request that states the hypothesis, pins the config, and invites community review. Once merged, Ray orchestrates the job across preemptible Google Cloud TPUs, streams metrics, and deposits artifacts that map 1-to-1 with the commit that launched them. Failures, restarts, and course corrections are kept in the open, rather than relegated to an appendix or hidden behind closed doors.
Using this workflow we trained Marin-8B, an 8-billion-parameter model that outperforms Llama 3.1 8B Base on 14 / 19 benchmarks. By conference time we will have scaled to 32 B parameters and stood up a highly scalable Ray-based RL pipeline for training agents.
The session will dissect the infrastructure—autoscaling, fault tolerance, dataset pipelines—and share lessons from hundreds of thousands of public TPU hours. We’ll also discuss ways to join our community: contribute a new optimizer in our Speedrun challenge, curate domain data through our Datashop, add a new environment for agentic RL, or just follow the logs in real time.
Read more
Open-source software thrives because its entire lifecycle—code, tests, debates, even missteps—remains publicly writable. Foundation models rarely meet that standard: most “open-weight” releases omit the training code, data recipe, and experiment logs that make results reproducible. Marin closes that gap. Every run begins life as a GitHub issue and pull request that states the hypothesis, pins the config, and invites community review. Once merged, Ray orchestrates the job across preemptible Google Cloud TPUs, streams metrics, and deposits artifacts that map 1-to-1 with the commit that launched them. Failures, restarts, and course corrections are kept in the open, rather than relegated to an appendix or hidden behind closed doors. Using this workflow we trained Marin-8B, an 8-billion-parameter model that outperforms Llama 3.1 8B Base on 14 / 19 benchmarks. By conference time we will have scaled to 32 B parameters and stood up a highly scalable Ray-based RL pipeline for training agents. The session will dissect the infrastructure—autoscaling, fault tolerance, dataset pipelines—and share lessons from hundreds of thousands of public TPU hours. We’ll also discuss ways to join our community: contribute a new optimizer in our Speedrun challenge, curate domain data through our Datashop, add a new environment for agentic RL, or just follow the logs in real time.
Read moreMarin closes that gap. Every run begins life as a GitHub issue and pull request that states the hypothesis, pins the config, and invites community review. Once merged, Ray orchestrates the job across preemptible Google Cloud TPUs, streams metrics, and deposits artifacts that map 1-to-1 with the commit that launched them. Failures, restarts, and course corrections are kept in the open, rather than relegated to an appendix or hidden behind closed doors.
Using this workflow we trained Marin-8B, an 8-billion-parameter model that outperforms Llama 3.1 8B Base on 14 / 19 benchmarks. By conference time we will have scaled to 32 B parameters and stood up a highly scalable Ray-based RL pipeline for training agents.
The session will dissect the infrastructure—autoscaling, fault tolerance, dataset pipelines—and share lessons from hundreds of thousands of public TPU hours. We’ll also discuss ways to join our community: contribute a new optimizer in our Speedrun challenge, curate domain data through our Datashop, add a new environment for agentic RL, or just follow the logs in real time.
SkyRL: A scalable and flexible post-training framework
SkyRL is a modular, performant RL framework for training language models with a focus on agentic tasks. SkyRL is built for flexible modification and extension to enable rapid prototyping and experimentation without compromising performance. In this talk, we discuss SkyRL's architecture and it's motivation, as well as infrastructure challenges in scaling RL for agentic task
In this talk, we present the motivation behind SkyRL and outline its core architectural principles, including modular policy training, customizable reward pipelines, and efficient distributed execution. We then discuss the infrastructure challenges inherent in scaling RL for agentic tasks, such as handling high-throughput environment interactions, optimizing distributed rollouts, and managing heterogeneous compute demands. Finally, we share insights from building SkyRL to highlight how principled framework design can accelerate research and unlock new possibilities in training agentic language models.
Read more
SkyRL is a modular, performant RL framework for training language models with a focus on agentic tasks. SkyRL is built for flexible modification and extension to enable rapid prototyping and experimentation without compromising performance. In this talk, we discuss SkyRL's architecture and it's motivation, as well as infrastructure challenges in scaling RL for agentic task In this talk, we present the motivation behind SkyRL and outline its core architectural principles, including modular policy training, customizable reward pipelines, and efficient distributed execution. We then discuss the infrastructure challenges inherent in scaling RL for agentic tasks, such as handling high-throughput environment interactions, optimizing distributed rollouts, and managing heterogeneous compute demands. Finally, we share insights from building SkyRL to highlight how principled framework design can accelerate research and unlock new possibilities in training agentic language models.
Read moreIn this talk, we present the motivation behind SkyRL and outline its core architectural principles, including modular policy training, customizable reward pipelines, and efficient distributed execution. We then discuss the infrastructure challenges inherent in scaling RL for agentic tasks, such as handling high-throughput environment interactions, optimizing distributed rollouts, and managing heterogeneous compute demands. Finally, we share insights from building SkyRL to highlight how principled framework design can accelerate research and unlock new possibilities in training agentic language models.
AI to help fight climate change
Read more
Accelerating AI Pipelines with AnalyticDB Ray: Alibaba Cloud's Approach to Data-AI Convergence
In the era of data-driven innovation, efficiently processing and analyzing multi-modal data is crucial for building effective AI pipelines. This presentation will focus on real-world applications of Alibaba Cloud's AnalyticDB Ray, showcasing how its capabilities are leveraged within a data warehouse environment to accelerate AI initiatives.
We will delve into practical use cases that demonstrate the seamless integration of multi-modal ETL and machine learning:
1.Optimizing Advertising Recommendation Inference: Learn how AnalyticDB Ray is used for offline batch inference in advertising recommendations, specifically for estimating click-through rates (CTR). This includes details on how heterogeneous resources (CPU and GPU) are independently and automatically scaled to maximize GPU utilization, achieving an increase from less than 5% to 40%. We will also discuss the dynamic auto-scaling of object storage based on data volume, which has improved data processing performance by 2 to 3 times.
2.Accelerating Large Language Model (LLM) Offline Batch Inference and Data Distillation: Discover how AnalyticDB Ray facilitates large model data preparation. We will illustrate the use of Ray Data and vLLM/SGLang for data distillation with models like Qwen and Deepseek, which then fuels large model training. Key benefits include a 2-3x improvement in data loading throughput due to caching , scheduling of 40,000 fine-grained tasks within a single Ray cluster , and a 50% performance increase for Deepseek INT8 quantization compared to FP8 in offline distillation scenarios.
3.Efficient Distributed Fine-tuning of Multi-modal Models: Explore how AnalyticDB Ray, integrated with Lance, enhances distributed image-text data processing and structuring using RayData for multi-modal personalized interactive scenarios. We will also showcase the integration with LLaMA-Factory to provide distributed fine-tuning capabilities for Qwen-VL multi-modal models. This offers a one-stop solution from data labeling to model fine-tuning and has improved distributed fine-tuning efficiency by 3-5 times.
These examples will illustrate how AnalyticDB Ray unlocks the potential for in-warehouse AI pipelines, seamlessly integrating multi-modal ETL and machine learning to accelerate the journey from data to intelligent decision-making.
Read more
In the era of data-driven innovation, efficiently processing and analyzing multi-modal data is crucial for building effective AI pipelines. This presentation will focus on real-world applications of Alibaba Cloud's AnalyticDB Ray, showcasing how its capabilities are leveraged within a data warehouse environment to accelerate AI initiatives. We will delve into practical use cases that demonstrate the seamless integration of multi-modal ETL and machine learning: 1.Optimizing Advertising Recommendation Inference: Learn how AnalyticDB Ray is used for offline batch inference in advertising recommendations, specifically for estimating click-through rates (CTR). This includes details on how heterogeneous resources (CPU and GPU) are independently and automatically scaled to maximize GPU utilization, achieving an increase from less than 5% to 40%. We will also discuss the dynamic auto-scaling of object storage based on data volume, which has improved data processing performance by 2 to 3 times. 2.Accelerating Large Language Model (LLM) Offline Batch Inference and Data Distillation: Discover how AnalyticDB Ray facilitates large model data preparation. We will illustrate the use of Ray Data and vLLM/SGLang for data distillation with models like Qwen and Deepseek, which then fuels large model training. Key benefits include a 2-3x improvement in data loading throughput due to caching , scheduling of 40,000 fine-grained tasks within a single Ray cluster , and a 50% performance increase for Deepseek INT8 quantization compared to FP8 in offline distillation scenarios. 3.Efficient Distributed Fine-tuning of Multi-modal Models: Explore how AnalyticDB Ray, integrated with Lance, enhances distributed image-text data processing and structuring using RayData for multi-modal personalized interactive scenarios. We will also showcase the integration with LLaMA-Factory to provide distributed fine-tuning capabilities for Qwen-VL multi-modal models. This offers a one-stop solution from data labeling to model fine-tuning and has improved distributed fine-tuning efficiency by 3-5 times. These examples will illustrate how AnalyticDB Ray unlocks the potential for in-warehouse AI pipelines, seamlessly integrating multi-modal ETL and machine learning to accelerate the journey from data to intelligent decision-making.
Read moreWe will delve into practical use cases that demonstrate the seamless integration of multi-modal ETL and machine learning:
1.Optimizing Advertising Recommendation Inference: Learn how AnalyticDB Ray is used for offline batch inference in advertising recommendations, specifically for estimating click-through rates (CTR). This includes details on how heterogeneous resources (CPU and GPU) are independently and automatically scaled to maximize GPU utilization, achieving an increase from less than 5% to 40%. We will also discuss the dynamic auto-scaling of object storage based on data volume, which has improved data processing performance by 2 to 3 times.
2.Accelerating Large Language Model (LLM) Offline Batch Inference and Data Distillation: Discover how AnalyticDB Ray facilitates large model data preparation. We will illustrate the use of Ray Data and vLLM/SGLang for data distillation with models like Qwen and Deepseek, which then fuels large model training. Key benefits include a 2-3x improvement in data loading throughput due to caching , scheduling of 40,000 fine-grained tasks within a single Ray cluster , and a 50% performance increase for Deepseek INT8 quantization compared to FP8 in offline distillation scenarios.
3.Efficient Distributed Fine-tuning of Multi-modal Models: Explore how AnalyticDB Ray, integrated with Lance, enhances distributed image-text data processing and structuring using RayData for multi-modal personalized interactive scenarios. We will also showcase the integration with LLaMA-Factory to provide distributed fine-tuning capabilities for Qwen-VL multi-modal models. This offers a one-stop solution from data labeling to model fine-tuning and has improved distributed fine-tuning efficiency by 3-5 times.
These examples will illustrate how AnalyticDB Ray unlocks the potential for in-warehouse AI pipelines, seamlessly integrating multi-modal ETL and machine learning to accelerate the journey from data to intelligent decision-making.
Exabyte-scale Streaming Iceberg IO with Ray, Flink, and DeltaCAT
Production case study highlighting how Amazon uses Ray and DeltaCAT at exabyte-scale to resolve longstanding performance & scale challenges integrating streaming pipelines with Apache Iceberg. Highlights how the Apache Flink, Ray, Apache Beam, and Apache Spark communities can start bringing the same benefits to their workloads using DeltaCAT's Iceberg Table management jobs on Ray together with Flink and Beam.
Read more
Production case study highlighting how Amazon uses Ray and DeltaCAT at exabyte-scale to resolve longstanding performance & scale challenges integrating streaming pipelines with Apache Iceberg. Highlights how the Apache Flink, Ray, Apache Beam, and Apache Spark communities can start bringing the same benefits to their workloads using DeltaCAT's Iceberg Table management jobs on Ray together with Flink and Beam.
Read moreverl: A Flexible and Efficient RL Framework for LLM Reasoning and Tool-calling
Large language models (LLMs) have enabled advances in language understanding and multimodal tasks, yet scaling reinforcement learning (RL) with these models remains challenging: most existing frameworks either lack the abstractions needed to define and manage complex dataflows or cannot handle models with billions of parameters.
verl (https://github.com/volcengine/verl) is an open-source framework for building end-to-end RL pipelines with LLMs. It provides high-level abstractions and optimizations for dataflow orchestration and resource management via a Ray-based hybrid-controller model. It offers high-level abstractions for dataflow orchestration and resource management: the entire RL dataflow runs as a single controller process on the Ray driver, issuing primitive API calls to WorkerGroup modules. The WorkerGroup and ResourcePool components distribute computation and resources across GPU clusters, delivering high throughput and strong extensibility.
Since its release, verl has seen adoption in both academic research and industry production. It integrates with major training backends (FSDP, FSDP2, Megatron-LM) and inference engines (vLLM, SGLang), and supports various RL algorithms (PPO, GRPO, DAPO, etc) with effortless scaling. Recent trends in reasoning models bring new challenges to RL infrastructure, such as efficient tool calling, multi-turn interactions, and capability to scale up to giant MoE models like DeepSeek 671B. To lower the barrier to RL for advanced reasoning and tool calling, we recently improved verl with (1) efficient request level async multi-turn rollout and tool calling, (2) integration with expert parallelism for large scale MoE models, and (3) async system architecture for off-policy / async RL algorithms and flexible device placement.
Read more
Large language models (LLMs) have enabled advances in language understanding and multimodal tasks, yet scaling reinforcement learning (RL) with these models remains challenging: most existing frameworks either lack the abstractions needed to define and manage complex dataflows or cannot handle models with billions of parameters. verl (https://github.com/volcengine/verl) is an open-source framework for building end-to-end RL pipelines with LLMs. It provides high-level abstractions and optimizations for dataflow orchestration and resource management via a Ray-based hybrid-controller model. It offers high-level abstractions for dataflow orchestration and resource management: the entire RL dataflow runs as a single controller process on the Ray driver, issuing primitive API calls to WorkerGroup modules. The WorkerGroup and ResourcePool components distribute computation and resources across GPU clusters, delivering high throughput and strong extensibility. Since its release, verl has seen adoption in both academic research and industry production. It integrates with major training backends (FSDP, FSDP2, Megatron-LM) and inference engines (vLLM, SGLang), and supports various RL algorithms (PPO, GRPO, DAPO, etc) with effortless scaling. Recent trends in reasoning models bring new challenges to RL infrastructure, such as efficient tool calling, multi-turn interactions, and capability to scale up to giant MoE models like DeepSeek 671B. To lower the barrier to RL for advanced reasoning and tool calling, we recently improved verl with (1) efficient request level async multi-turn rollout and tool calling, (2) integration with expert parallelism for large scale MoE models, and (3) async system architecture for off-policy / async RL algorithms and flexible device placement.
Read moreverl (https://github.com/volcengine/verl) is an open-source framework for building end-to-end RL pipelines with LLMs. It provides high-level abstractions and optimizations for dataflow orchestration and resource management via a Ray-based hybrid-controller model. It offers high-level abstractions for dataflow orchestration and resource management: the entire RL dataflow runs as a single controller process on the Ray driver, issuing primitive API calls to WorkerGroup modules. The WorkerGroup and ResourcePool components distribute computation and resources across GPU clusters, delivering high throughput and strong extensibility.
Since its release, verl has seen adoption in both academic research and industry production. It integrates with major training backends (FSDP, FSDP2, Megatron-LM) and inference engines (vLLM, SGLang), and supports various RL algorithms (PPO, GRPO, DAPO, etc) with effortless scaling. Recent trends in reasoning models bring new challenges to RL infrastructure, such as efficient tool calling, multi-turn interactions, and capability to scale up to giant MoE models like DeepSeek 671B. To lower the barrier to RL for advanced reasoning and tool calling, we recently improved verl with (1) efficient request level async multi-turn rollout and tool calling, (2) integration with expert parallelism for large scale MoE models, and (3) async system architecture for off-policy / async RL algorithms and flexible device placement.
AIBrix and DeerFlow in Action: Scalable LLM Inference for Agentic Workloads
As large language models (LLMs) continue to drive next-generation AI and agentic workloads, infrastructure teams face growing challenges to deliver performance, scalability, and cost-efficiency at production scale. In this talk, we introduce AIBrix—an open-source, Kubernetes- and Ray-powered control plane purpose-built to streamline and optimize LLM inference.
Co-developed with the vLLM community, AIBrix delivers:
- LLM-Specific Autoscaling – Responsive, workload-aware scaling for improved resource efficiency
- Smart KVCache Management – Multi-level caching with offloading and prefix-aware reuse to reduce memory usage and latency
- Load- and Cache-Aware Routing – Adaptive routing strategies that ensure fair, low-latency traffic distribution under real-world load
We’ll also highlight recent innovations, including dynamic LoRA orchestration and cost-effective heterogeneous hardware support. Finally, we’ll showcase how AIBrix powers agentic workloads through DeerFlow, an open-source deep research framework. You’ll see real-world use cases, such as building a personal research assistant on top of open-source LLMs, and learn how AIBrix enables reliable, scalable, and low-latency agent execution.
Join us to explore AIBrix’s architecture, performance breakthroughs, and how it’s shaping the future of enterprise-grade LLM infrastructure.
Read more
As large language models (LLMs) continue to drive next-generation AI and agentic workloads, infrastructure teams face growing challenges to deliver performance, scalability, and cost-efficiency at production scale. In this talk, we introduce AIBrix—an open-source, Kubernetes- and Ray-powered control plane purpose-built to streamline and optimize LLM inference. Co-developed with the vLLM community, AIBrix delivers: - LLM-Specific Autoscaling – Responsive, workload-aware scaling for improved resource efficiency - Smart KVCache Management – Multi-level caching with offloading and prefix-aware reuse to reduce memory usage and latency - Load- and Cache-Aware Routing – Adaptive routing strategies that ensure fair, low-latency traffic distribution under real-world load We’ll also highlight recent innovations, including dynamic LoRA orchestration and cost-effective heterogeneous hardware support. Finally, we’ll showcase how AIBrix powers agentic workloads through DeerFlow, an open-source deep research framework. You’ll see real-world use cases, such as building a personal research assistant on top of open-source LLMs, and learn how AIBrix enables reliable, scalable, and low-latency agent execution. Join us to explore AIBrix’s architecture, performance breakthroughs, and how it’s shaping the future of enterprise-grade LLM infrastructure.
Read moreCo-developed with the vLLM community, AIBrix delivers:
- LLM-Specific Autoscaling – Responsive, workload-aware scaling for improved resource efficiency
- Smart KVCache Management – Multi-level caching with offloading and prefix-aware reuse to reduce memory usage and latency
- Load- and Cache-Aware Routing – Adaptive routing strategies that ensure fair, low-latency traffic distribution under real-world load
We’ll also highlight recent innovations, including dynamic LoRA orchestration and cost-effective heterogeneous hardware support. Finally, we’ll showcase how AIBrix powers agentic workloads through DeerFlow, an open-source deep research framework. You’ll see real-world use cases, such as building a personal research assistant on top of open-source LLMs, and learn how AIBrix enables reliable, scalable, and low-latency agent execution.
Join us to explore AIBrix’s architecture, performance breakthroughs, and how it’s shaping the future of enterprise-grade LLM infrastructure.
The Decoupled Agent: A Vision for a Native Agentic Runtime on Ray and Kubernetes
Agentic AI systems—composed of LLMs, tools, and stateful agents—present a fundamental distributed computing challenge. We argue that Ray's architecture, built on a foundation of stateless tasks and stateful actors, provides a natural and powerful execution runtime for these systems. This is not an accident of implementation, but a core design principle: Ray’s architecture treats stateful and stateless computation fundamentally differently. Ultimately, it means you don't have to build a separate, complex service discovery and state management system for your agent; Ray provides it natively.
This talk presents a vision for this paradigm, demonstrated with a reference architecture on Kubernetes. Kubernetes provides the declarative foundation for the entire system, managing container lifecycles, service discovery, and network policies, while the KubeRay operator acts as the bridge to deploy and manage Ray clusters natively. We show how to build a robust system where the agent, built using the Agent Development Kit (ADK) to structure its logic and tool-use capabilities, is deployed as a Ray Actor. Actor lifecycle and location are managed centrally by Ray’s GCS, providing the durability and discoverability needed for a reliable agent. In contrast, the agent's tools are executed as stateless Ray Tasks. Task submission follows a decentralized ownership model, enabling extremely high-throughput, low-latency execution ideal for function calling. This decoupled architecture, built entirely on open-source Ray, represents the foundational pattern for portable, vendor-neutral agentic systems.
The interaction between Ray and Kubernetes becomes particularly powerful when the KubeRay operator makes Ray topology-aware, inheriting node labels to allow for fine-grained placement of actors for high availability or low-latency. By leveraging this deep architectural alignment, we can build a truly resilient and scalable agentic runtime.
Read more
Agentic AI systems—composed of LLMs, tools, and stateful agents—present a fundamental distributed computing challenge. We argue that Ray's architecture, built on a foundation of stateless tasks and stateful actors, provides a natural and powerful execution runtime for these systems. This is not an accident of implementation, but a core design principle: Ray’s architecture treats stateful and stateless computation fundamentally differently. Ultimately, it means you don't have to build a separate, complex service discovery and state management system for your agent; Ray provides it natively. This talk presents a vision for this paradigm, demonstrated with a reference architecture on Kubernetes. Kubernetes provides the declarative foundation for the entire system, managing container lifecycles, service discovery, and network policies, while the KubeRay operator acts as the bridge to deploy and manage Ray clusters natively. We show how to build a robust system where the agent, built using the Agent Development Kit (ADK) to structure its logic and tool-use capabilities, is deployed as a Ray Actor. Actor lifecycle and location are managed centrally by Ray’s GCS, providing the durability and discoverability needed for a reliable agent. In contrast, the agent's tools are executed as stateless Ray Tasks. Task submission follows a decentralized ownership model, enabling extremely high-throughput, low-latency execution ideal for function calling. This decoupled architecture, built entirely on open-source Ray, represents the foundational pattern for portable, vendor-neutral agentic systems. The interaction between Ray and Kubernetes becomes particularly powerful when the KubeRay operator makes Ray topology-aware, inheriting node labels to allow for fine-grained placement of actors for high availability or low-latency. By leveraging this deep architectural alignment, we can build a truly resilient and scalable agentic runtime.
Read moreThis talk presents a vision for this paradigm, demonstrated with a reference architecture on Kubernetes. Kubernetes provides the declarative foundation for the entire system, managing container lifecycles, service discovery, and network policies, while the KubeRay operator acts as the bridge to deploy and manage Ray clusters natively. We show how to build a robust system where the agent, built using the Agent Development Kit (ADK) to structure its logic and tool-use capabilities, is deployed as a Ray Actor. Actor lifecycle and location are managed centrally by Ray’s GCS, providing the durability and discoverability needed for a reliable agent. In contrast, the agent's tools are executed as stateless Ray Tasks. Task submission follows a decentralized ownership model, enabling extremely high-throughput, low-latency execution ideal for function calling. This decoupled architecture, built entirely on open-source Ray, represents the foundational pattern for portable, vendor-neutral agentic systems.
The interaction between Ray and Kubernetes becomes particularly powerful when the KubeRay operator makes Ray topology-aware, inheriting node labels to allow for fine-grained placement of actors for high availability or low-latency. By leveraging this deep architectural alignment, we can build a truly resilient and scalable agentic runtime.
Lance ♥️ Ray: Accelerating Multimodal Feature Engineering
LanceDB is building the Multimodal Lakehouse—a foundation for multimodal AI applications that unify unstructured and structured data. At its core, Ray powers our feature engineering engine, enabling AI engineers using LanceDB to focus on highly creative work instead of data infrastructure.
In this talk, we share how LanceDB uses Ray to deliver reliable and efficient multimodal data processing. We’ll walk through our design for incremental and filtered backfills, fine-grained checkpointing, and robust preemption handling—all built using Ray’s flexible execution primitives. We'll also discuss how we deploy across local dev setups, custom Ray clusters, and cloud deployments —making our platform adaptable from laptops to cloud clusters.
Whether you're building your own AI data platform or wrestling with feature pipelines for unstructured data, this talk will show how Ray can be a backbone for scalable, resilient, and intelligent data transformation.
Read more
LanceDB is building the Multimodal Lakehouse—a foundation for multimodal AI applications that unify unstructured and structured data. At its core, Ray powers our feature engineering engine, enabling AI engineers using LanceDB to focus on highly creative work instead of data infrastructure. In this talk, we share how LanceDB uses Ray to deliver reliable and efficient multimodal data processing. We’ll walk through our design for incremental and filtered backfills, fine-grained checkpointing, and robust preemption handling—all built using Ray’s flexible execution primitives. We'll also discuss how we deploy across local dev setups, custom Ray clusters, and cloud deployments —making our platform adaptable from laptops to cloud clusters. Whether you're building your own AI data platform or wrestling with feature pipelines for unstructured data, this talk will show how Ray can be a backbone for scalable, resilient, and intelligent data transformation.
Read moreIn this talk, we share how LanceDB uses Ray to deliver reliable and efficient multimodal data processing. We’ll walk through our design for incremental and filtered backfills, fine-grained checkpointing, and robust preemption handling—all built using Ray’s flexible execution primitives. We'll also discuss how we deploy across local dev setups, custom Ray clusters, and cloud deployments —making our platform adaptable from laptops to cloud clusters.
Whether you're building your own AI data platform or wrestling with feature pipelines for unstructured data, this talk will show how Ray can be a backbone for scalable, resilient, and intelligent data transformation.
Matrix: reliable framework for data-centric experimentation at scale
Scaled and high quality data is the oil driving progress of AGI in research and development. Thanks to the foundational works such as Ray, Slurm, and vLLM, it becomes a lot easier to manage compute resources at scale and access to a diverse set of SOTA LLMs. However, these efforts are often designed for experienced engineers with entry barriers for researchers to unleash their full potential. Thus, in Fundamental AI Research Lab (FAIR) at Meta, we have built Matrix, a reliable framework for data-centric experimentation at scale, to connect these foundational pieces for researchers to quickly iterate on their ideas and build experiments with large-scale models and data. Matrix supports robust and auto-scaled data generation from LLMs, game engine, and physics or world model simulators, with one command. It also offers easy setup for scalable data processing and augmentation such as LLM-as-a-judge in batch, safe code execution for verification, and data dedup, classification, or clustering. The framework also offers efficient and reproducible evaluation pipelines for large teams to collaborate on. Matrix is widely used to empower Meta’s research and production bets in AGI including MLLMs and world modeling. In this session, we will introduce the Matrix framework from its design and synergy with other industry initiatives like Ray and vLLM, to the research and production use cases Matrix enables. We will also provide a short tutorial for developers to join the area.
Read more
Scaled and high quality data is the oil driving progress of AGI in research and development. Thanks to the foundational works such as Ray, Slurm, and vLLM, it becomes a lot easier to manage compute resources at scale and access to a diverse set of SOTA LLMs. However, these efforts are often designed for experienced engineers with entry barriers for researchers to unleash their full potential. Thus, in Fundamental AI Research Lab (FAIR) at Meta, we have built Matrix, a reliable framework for data-centric experimentation at scale, to connect these foundational pieces for researchers to quickly iterate on their ideas and build experiments with large-scale models and data. Matrix supports robust and auto-scaled data generation from LLMs, game engine, and physics or world model simulators, with one command. It also offers easy setup for scalable data processing and augmentation such as LLM-as-a-judge in batch, safe code execution for verification, and data dedup, classification, or clustering. The framework also offers efficient and reproducible evaluation pipelines for large teams to collaborate on. Matrix is widely used to empower Meta’s research and production bets in AGI including MLLMs and world modeling. In this session, we will introduce the Matrix framework from its design and synergy with other industry initiatives like Ray and vLLM, to the research and production use cases Matrix enables. We will also provide a short tutorial for developers to join the area.
Read moreBreaking the Dataset Iteration Bottleneck: Real-Time ML Experimentation with Ray
At Pinterest, iterating on dataset curation and label generation consistently improves our recommendation models, but this process is severely constrained by expensive and time-consuming data generation workflows. When experimenting with new sampling strategies, features, or labels, teams face a critical choice: either backfill long-running jobs that strain compute resources and budget, or wait weeks for experimental datasets to naturally populate with new data. This creates a fundamental barrier to data-driven model improvement, where a single dataset iteration either costs thousands of dollars and requires a tedious monitoring of the backfill process, or takes weeks of waiting. In either case, the developer velocity is severely impacted.
Two pivotal use-cases within Pinterest exemplify these challenges, namely the Dataset Sampling Strategy Exploration and the Generation of Labels for Downstream Engagement Modeling. Sampling is fundamental for creating training datasets from massive data repositories. Our sampling strategy determines the composition and quality of resulting datasets, thus the resulting model, yet iterating on these strategies is prohibitively difficult and expensive. The current data generation workflows also prevent adoption of sophisticated techniques like score-based negative sampling that requires real-time computation during training. Downstream engagement labels present a similarly complex challenge. Unlike immediate action labels, these labels focus on driving long-term user engagement rather than instant responses. The complexity increases because each label involves multiple tunable hyperparameters (e.g. engagement decay) creating a vast search space. In both cases, teams would ideally conduct hyperparameter tuning to systematically explore these vast search spaces and identify optimal configurations, but the current data pipeline architecture makes such comprehensive exploration prohibitively expensive and time-consuming.
To address these limitations, we shifted both use-cases from static dataset generation to a streaming paradigm using Ray that enables truly iterative experimentation, moving sampling and label generation logic directly into the training dataloader to process data in real-time. This eliminates the costly choice between expensive backfills and weeks of waiting, while enabling comprehensive hyperparameter exploration. The impact spans both domains: sampling changes on ranking models now ship with 10x faster development time, while downstream engagement label experimentation has been reduced from 6 weeks to 3 days and adopted by multiple teams. The solution's power is fully realized during productization, where teams must simultaneously optimize both label generation parameters and sampling strategies - our unified approach handles both seamlessly within the same pipeline. Combined with Ray's bucket join capabilities that enable joining large embedding features and multiday datasets previously impossible due to cost and compute constraints, this has saved over hundreds of thousands dollars in costs while transforming dataset iteration from a fundamental bottleneck into an enabler of rapid experimentation.
Read more
At Pinterest, iterating on dataset curation and label generation consistently improves our recommendation models, but this process is severely constrained by expensive and time-consuming data generation workflows. When experimenting with new sampling strategies, features, or labels, teams face a critical choice: either backfill long-running jobs that strain compute resources and budget, or wait weeks for experimental datasets to naturally populate with new data. This creates a fundamental barrier to data-driven model improvement, where a single dataset iteration either costs thousands of dollars and requires a tedious monitoring of the backfill process, or takes weeks of waiting. In either case, the developer velocity is severely impacted. Two pivotal use-cases within Pinterest exemplify these challenges, namely the Dataset Sampling Strategy Exploration and the Generation of Labels for Downstream Engagement Modeling. Sampling is fundamental for creating training datasets from massive data repositories. Our sampling strategy determines the composition and quality of resulting datasets, thus the resulting model, yet iterating on these strategies is prohibitively difficult and expensive. The current data generation workflows also prevent adoption of sophisticated techniques like score-based negative sampling that requires real-time computation during training. Downstream engagement labels present a similarly complex challenge. Unlike immediate action labels, these labels focus on driving long-term user engagement rather than instant responses. The complexity increases because each label involves multiple tunable hyperparameters (e.g. engagement decay) creating a vast search space. In both cases, teams would ideally conduct hyperparameter tuning to systematically explore these vast search spaces and identify optimal configurations, but the current data pipeline architecture makes such comprehensive exploration prohibitively expensive and time-consuming. To address these limitations, we shifted both use-cases from static dataset generation to a streaming paradigm using Ray that enables truly iterative experimentation, moving sampling and label generation logic directly into the training dataloader to process data in real-time. This eliminates the costly choice between expensive backfills and weeks of waiting, while enabling comprehensive hyperparameter exploration. The impact spans both domains: sampling changes on ranking models now ship with 10x faster development time, while downstream engagement label experimentation has been reduced from 6 weeks to 3 days and adopted by multiple teams. The solution's power is fully realized during productization, where teams must simultaneously optimize both label generation parameters and sampling strategies - our unified approach handles both seamlessly within the same pipeline. Combined with Ray's bucket join capabilities that enable joining large embedding features and multiday datasets previously impossible due to cost and compute constraints, this has saved over hundreds of thousands dollars in costs while transforming dataset iteration from a fundamental bottleneck into an enabler of rapid experimentation.
Read moreTwo pivotal use-cases within Pinterest exemplify these challenges, namely the Dataset Sampling Strategy Exploration and the Generation of Labels for Downstream Engagement Modeling. Sampling is fundamental for creating training datasets from massive data repositories. Our sampling strategy determines the composition and quality of resulting datasets, thus the resulting model, yet iterating on these strategies is prohibitively difficult and expensive. The current data generation workflows also prevent adoption of sophisticated techniques like score-based negative sampling that requires real-time computation during training. Downstream engagement labels present a similarly complex challenge. Unlike immediate action labels, these labels focus on driving long-term user engagement rather than instant responses. The complexity increases because each label involves multiple tunable hyperparameters (e.g. engagement decay) creating a vast search space. In both cases, teams would ideally conduct hyperparameter tuning to systematically explore these vast search spaces and identify optimal configurations, but the current data pipeline architecture makes such comprehensive exploration prohibitively expensive and time-consuming.
To address these limitations, we shifted both use-cases from static dataset generation to a streaming paradigm using Ray that enables truly iterative experimentation, moving sampling and label generation logic directly into the training dataloader to process data in real-time. This eliminates the costly choice between expensive backfills and weeks of waiting, while enabling comprehensive hyperparameter exploration. The impact spans both domains: sampling changes on ranking models now ship with 10x faster development time, while downstream engagement label experimentation has been reduced from 6 weeks to 3 days and adopted by multiple teams. The solution's power is fully realized during productization, where teams must simultaneously optimize both label generation parameters and sampling strategies - our unified approach handles both seamlessly within the same pipeline. Combined with Ray's bucket join capabilities that enable joining large embedding features and multiday datasets previously impossible due to cost and compute constraints, this has saved over hundreds of thousands dollars in costs while transforming dataset iteration from a fundamental bottleneck into an enabler of rapid experimentation.
Building a Model Fitting Framework for Quant Finance with Ray & Anyscale
Quant trading and research teams at Point72/Cubist have diverse needs related to data, models, and their specific use cases. Investing in an on-premise Ray cluster enabled Ray-focused approaches, but adoption has not always been seamless. Challenges emerged around data management (loading, reuse, access), scaling (efficiently performing parallel windowed model training, sometimes on tens of terrabytes of timeseries data), and platform usage (determining how and when to utilize an Anyscale cluster).
The Cubist Core Research Technologies team has developed libraries to tackle these challenges. We'll discuss patterns we found to be successful, those that fell short, and share insights that we believe can be widely applied.
Read more
Quant trading and research teams at Point72/Cubist have diverse needs related to data, models, and their specific use cases. Investing in an on-premise Ray cluster enabled Ray-focused approaches, but adoption has not always been seamless. Challenges emerged around data management (loading, reuse, access), scaling (efficiently performing parallel windowed model training, sometimes on tens of terrabytes of timeseries data), and platform usage (determining how and when to utilize an Anyscale cluster). The Cubist Core Research Technologies team has developed libraries to tackle these challenges. We'll discuss patterns we found to be successful, those that fell short, and share insights that we believe can be widely applied.
Read moreThe Cubist Core Research Technologies team has developed libraries to tackle these challenges. We'll discuss patterns we found to be successful, those that fell short, and share insights that we believe can be widely applied.
Ray-Powered ML Cloud Migration at Uber: Enabling Large-Scale Model Training on GPUs/TPUs
As Uber accelerates its machine learning infrastructure migration to the cloud, Ray has played a pivotal role in enabling scalable, flexible, and efficient distributed training across heterogeneous compute environments. This talk presents our journey of leveraging Ray to streamline the transition, with a focus on training large-scale models such as LLMs and recommendation systems on high-end GPUs and TPUs.
We will highlight key architectural strategies including cross-cloud training and disaster recovery planning, which ensure robustness and continuity in production-grade ML workloads. The session also dives into our approach for selecting and optimizing GPU/TPU hardware to maximize training throughput and cost efficiency, supported by benchmarks and operational learnings.
Additionally, we will introduce our early exploration of Ray Turbo and its potential to further improve training performance at scale. Real-world use cases—such as large language models and two-tower recommendation systems—will ground the discussion, offering insights into how Uber scales deep learning in the cloud while maintaining agility and resilience.
Read more
As Uber accelerates its machine learning infrastructure migration to the cloud, Ray has played a pivotal role in enabling scalable, flexible, and efficient distributed training across heterogeneous compute environments. This talk presents our journey of leveraging Ray to streamline the transition, with a focus on training large-scale models such as LLMs and recommendation systems on high-end GPUs and TPUs. We will highlight key architectural strategies including cross-cloud training and disaster recovery planning, which ensure robustness and continuity in production-grade ML workloads. The session also dives into our approach for selecting and optimizing GPU/TPU hardware to maximize training throughput and cost efficiency, supported by benchmarks and operational learnings. Additionally, we will introduce our early exploration of Ray Turbo and its potential to further improve training performance at scale. Real-world use cases—such as large language models and two-tower recommendation systems—will ground the discussion, offering insights into how Uber scales deep learning in the cloud while maintaining agility and resilience.
Read moreWe will highlight key architectural strategies including cross-cloud training and disaster recovery planning, which ensure robustness and continuity in production-grade ML workloads. The session also dives into our approach for selecting and optimizing GPU/TPU hardware to maximize training throughput and cost efficiency, supported by benchmarks and operational learnings.
Additionally, we will introduce our early exploration of Ray Turbo and its potential to further improve training performance at scale. Real-world use cases—such as large language models and two-tower recommendation systems—will ground the discussion, offering insights into how Uber scales deep learning in the cloud while maintaining agility and resilience.
Rebuild and Accelerate: A Case study on deploying ML and AI workloads on Kuberay
Rebuilding a ML and AI serving layer is a daunting task, but has massive upside. In this talk we'll cover a case study on how we migrated 10 models to KubeRay to power our realtime and batch workloads and reduced costs by 50% while improving throughput by 10x. We'll also cover the information and technical architecture that allowed us to reduce the time to production from 1 months to 2 days. Finally we'll wrap up with tradeoffs in serving Gen AI vs Encoder based models within an internal kubernetes environment.
Read more
Rebuilding a ML and AI serving layer is a daunting task, but has massive upside. In this talk we'll cover a case study on how we migrated 10 models to KubeRay to power our realtime and batch workloads and reduced costs by 50% while improving throughput by 10x. We'll also cover the information and technical architecture that allowed us to reduce the time to production from 1 months to 2 days. Finally we'll wrap up with tradeoffs in serving Gen AI vs Encoder based models within an internal kubernetes environment.
Read more