Through our experience building Ray, we work with hundreds of AI teams and platform teams productionizing AI.¹ As AI workloads have evolved from classical ML to deep learning to generative AI, the software stack for running and scaling these workloads has ballooned in complexity.
Over time, industries standardize on common tech stacks. In the case of infrastructure products, the standards are often open source. Just as Spark eclipsed Hadoop for big-data analytics, Kubernetes has become the standard for container orchestration and PyTorch dominates deep learning.
Now over a decade into deep learning, many teams are landing on the same recipe. This blog post describes a snapshot of that emerging stack based on our experience working with Ray users.
We’ll break down each layer, then illustrate it with case studies from Pinterest, Uber, Roblox, and five popular open source post-training frameworks.
The right stack connects a team’s AI workloads to their underlying compute resources (especially GPUs). The teams piecing together these software stacks tend to solve for a few common requirements.
Three key workloads. They need to support model training, model serving, and batch inference (especially multimodal data processing). Within training, pre-training and post-training are emerging as distinct workloads each with their own requirements and tech stacks.
Massive scale. AI is computationally intensive as well as data intensive. Platform and AI teams are working to accommodate orders of magnitude more data. Supporting scale means solving hard reliability and cost-efficiency challenges.
Iteration speed. Top AI teams track the number of experiments per developer per month. Unfortunately, performance and scale are sometimes at odds with rapid iteration. In the worst case, new ideas can be incompatible with existing optimized infrastructure.
Future-proofing. No one wants to rearchitect code to adopt the latest models, frameworks, or sources of compute capacity. Swapping in H100 spot nodes should be a parameter change, not a rewrite.
A common recipe for AI compute is Kubernetes + Ray + PyTorch + vLLM.
This stack consists of three layers, each with a distinct slice of responsibilities.
The training and inference framework: Run models fast on GPUs: model optimizations, parallelism strategies, model definition, and automatic differentiation.
The distributed compute engine: Schedule tasks within a job, move data, handle failures, autoscale workloads, manage process lifecycles.
The container orchestrator: Allocate nodes, schedule entire jobs, spin up containers, manage multitenancy.
Fig 1. Kubernetes + Ray + PyTorch + vLLM: a popular software stack for AI compute.
To make this concrete, we’ll ground the layers in a real workload: automated video moderation. The workload stages are
Video decoding and standardization (CPU)
Computer vision and audio tasks (GPU)
LLM reasoning (GPU)
GPUs and other accelerators are the centerpiece of AI compute, and so it is vitally important that models take full advantage of them.
Running example: In the video processing example, all of the computer vision tasks, the audio tasks, and other classification tasks would be executed by a deep learning framework like PyTorch. An LLM inference engine like vLLM would run LLM reasoning across multiple GPUs.
A deep learning framework is at the heart of this layer. PyTorch is the dominant framework today, with Jax as a popular alternative, especially for running on TPUs. These frameworks have consolidated after a decade-long shake-out including frameworks like Theano, Torch, Caffe, Chainer, MXNet, and TensorFlow.
Surrounding PyTorch, we have frameworks designed for model parallelism (sharding models across multiple hardware accelerators). These frameworks can be divided along two primary axes. First, these frameworks may be designed for training or for inference (or for both). Second, they may be general purpose (meaning they can be used with any model) or specialized for transformers. Transformers are sufficiently important and complex that they deserve their own specialized frameworks.
PyTorch FSDP (fully-sharded data parallel) and DeepSpeed are general frameworks for model parallelism, primarily optimized for training. Nvidia Megatron is a model-parallel training framework specialized for transformers. This specialization allows for more optimized sharding strategies (e.g., different sharding strategies for different weights).
Another set of frameworks are vLLM, SGLang, and Nvidia TensorRT-LLM. These are inference engines specialized for running transformers. There are a number of important techniques for optimizing inference in transformers. Examples include continuous batching, paged attention, speculative decoding, and chunked prefill.
Running example: In the video processing example, a distributed compute engine like Ray handles the scheduling of different stages of processing across containers and nodes, the management and autoscaling of the compute resources allocated to each stage, the movement of data between stages, and the recovery from hardware and application failures (by redoing lost work).
The compute engine treats the training and inference code as just another task. For example, Ray might schedule vLLM tasks to do batch LLM inference as part of a broader data processing pipeline.
Ray and Spark are two of the most popular distributed compute engines. Ray is Python-native and GPU-aware, ideal for model training, model serving, and multimodal data processing. Spark is a unified analytics engine whose sweet spot is large-scale SQL and dataframe operations on CPUs.
It’s worth noting that failure handling and autoscaling responsibilities are split with the container orchestrator layer.
Failure handling: The distributed compute engine is workload aware and is responsible for ensuring that the overall AI workload catches failures and continues executing smoothly (possibly restarting dead processes, re-executing interrupted tasks, or reconfiguring itself on the remaining available compute). The container orchestrator is not workload aware. It is responsible for removing the problematic GPU (or other component) from the underlying cluster.
Autoscaling: The distributed compute engine is responsible for workload-aware autoscaling. It ensures that the AI workload requests the resources it needs (from the container orchestrator), releases resources when they are no longer needed, and reconfigures itself to run on the compute it has available. The container orchestrator is responsible for actually obtaining the required compute resources from the underlying cloud provider, allocating them to different AI workloads, and possibly taking them back when appropriate.
Running example: In the video processing example, a container orchestrator like Kubernetes obtains compute resources and allocates them to the data processing workload, mediates competing resource demands from multiple concurrent jobs, and creates and manages the containers that run the workload.
Workloads do not exist in isolation. In a typical company, many users may run many jobs, and these jobs share a common pool of compute resources.
The container orchestration layer is responsible for scheduling workloads and multiplexing users across a shared pool of compute. The scheduling granularity is important: the compute orchestrator schedules entire jobs and the distributed compute engine schedules the tasks & processes that make up an individual job. This layer is responsible for creating containers and managing container lifecycles. Furthermore, it is responsible for interfacing with cloud providers to select instance types and provision compute.
Kubernetes is the dominant framework at this layer. However, SLURM is a popular alternative. It is especially popular among researchers, and its strength is in offline workloads (training and batch processing). Kubernetes handles both, but was originally created for microservices and therefore biases more toward online workloads. That said, while many teams start off using SLURM, the momentum is toward standardizing on Kubernetes.
One subtle point about this layer is the target users. Kubernetes fundamentally targets platform engineers, the people who operate clusters, configure infrastructure through YAML, manage multi-tenancy and resource sharing, and standardize policies across their organization. Ray, PyTorch, and vLLM target ML developers, people who write Python code and use PyTorch and Hugging Face, people who may find Docker and Kubernetes intimidating, but have no problem debugging complex numerical bugs or performance bottlenecks in their inference pipelines. Individually, these layers do not satisfy the needs of both sets of users, but together, they can meet both sets of requirements.
A few companies have written extensively about their AI compute tech stacks.
Pinterest wrote an impressive three part blog series covering AI infra at Pinterest.
Part 1 is on last mile data processing for training and covers developer velocity bottlenecks, especially around ML dataset iteration
Part 2 is on managing Ray at Pinterest and covers the year-long journey to adopt Ray best practices and manage Ray workloads in production. It covers all of the challenges from the interaction with Kubernetes to authentication to logging to metrics to cost optimization.
Part 3 is on batch inference and describes how they evolved their previous generation batch inference solution by pairing Ray with vLLM and PyTorch.
Their tech stack perfectly exhibits the Kubernetes + Ray + PyTorch + vLLM combination.
Workload | Container orchestrator | Distributed compute engine | Training & inference framework | Outcomes |
---|---|---|---|---|
Last-mile data processing + model training | Kubernetes-based PinCompute platform | Previously Spark, now Ray | PyTorch | Dataset-iteration wall-clock cut by 6x (from 90h to 15h). Dev cycles cut from days to hours. GPU utilization over 90%. Training throughput up 45% while per-job cost down 25%. ✧ (blog) |
Offline / batch inference | Kubernetes-based PinCompute platform | Previously Spark, now Ray | PyTorch, vLLM | 30x decrease in cost for search-quality jobs. 4.5x throughput increase. 4x reduction in job runtime for GPU inference jobs. Running around 1800 jobs per month.✧ (blog, blog) |
Large-scale training Especially recommender models | Kubernetes-based PinCompute platform | Ray | PyTorch | Running 5000+ training jobs per month. Transparent scaling. Greater developer velocity and interactive development. Enable heterogeneous resources to keep GPUs saturated. ✧ (blog, talk) |
Fig 2. Components of Pinterest’s software stack for managing AI compute.
Uber has been highly influential in the ML platform space and has written extensively about their AI infrastructure journey.
They announced Michelangelo in 2017, a pioneering ML platform, where they detail their stack from managing data to training models to deploying models to monitoring predictions.
Around the same time, they open sourced Horovod, an early distributed training framework. They followed that post up with additional details on scaling both Horovod with Ray for deep learning and XGBoost with Ray for classical ML.
When it comes to batch inference, they wrote about their PySpark-based architecture and spoke recently about an architecture based on Ray + vLLM.
In this blog post, they describe the steps they took to optimize the cost and reliability of their training and serving infrastructure, across both on-prem and cloud settings.
In 2024, they shared an extensive history of the evolution of Michelangelo and how it made the transition from predictive ML to generative AI.
They describe their LLM training tech stack, which is based on PyTorch, Ray Train, Hugging Face Transformers, and DeepSpeed.
In 2024, Uber migrated its machine learning workloads to Kubernetes and recently described that migration in two parts. Part 1 describes the open source tech stack they adopted in order to enable this migration and the challenges they had to solve along the way. Part 2 zooms in on their job management platform on top of Kubernetes and some of the enhancements they had to make to Kubernetes.
Uber has evolved its ML platform significantly and has used many different open source technologies as part of its stack. The Kubernetes + Ray + PyTorch + vLLM combination plays a major role in their tech stack today, and it is surrounded by many other tools. Notably, Uber makes heavy use of Spark.
Workload | Container orchestrator | Distributed compute engine | Training & inference framework | Outcomes |
---|---|---|---|---|
LLM training & evaluation Fine-tuning 7B to 70B models on A100s & H100s; model evals | Originally Peloton, now Kubernetes-based Michelangelo Job Controller | Ray | PyTorch DDP, DeepSpeed, Hugging Face Transformers, vLLM for offline scoring | 2-7x larger batches, 2-3x throughput increase on Llama-2 70B. ✧ (blog, blog) |
Deep-learning & classical ML training Distributed training, hyperparameter optimization | Originally Peloton, now Kubernetes-based Michelangelo Job Controller | Originally Spark, now Ray | Horovod, XGBoost, PyTorch | |
Batch inference Classical models and LLMs | Originally Peloton, now Kubernetes-based Michelangelo Job Controller | Spark, Ray | TensorFlow, PyTorch, XGBoost, vLLM, Triton for embeddings | Scalability and GPU support. Multi-GPU inference. ✧ (blog, talk) |
Model serving Latency sensitive | Originally Peloton, now Kubernetes-based Michelangelo Job Controller | Michelangelo online prediction service | TensorFlow, PyTorch, previously served with Neuropod, now Triton | Framework agnostic, support for low-latency GPU serving. ✧ (blog) |
Marketplace-incentive optimization Adjusting incentives & discounts across thousands of cities | Originally Peloton, now Kubernetes-based Michelangelo Job Controller | Originally Spark, switched to Spark + Ray hybrid | CVXOPT and pure Python logic for optimization | 40x overall speed-up. Reduced job deployment from 15–20 min to 2 min. Improved iteration speed. ✧ (blog) |
Fig 3. Components of Uber’s software stack for managing AI compute.
Roblox has also written about their ML platform evolution. They describe three phases of evolution.
Phase 1 centered on Kubeflow and Spark for data processing and training workloads and on KServe and Triton for online serving.
Phase 2 focused on scaling training to support larger datasets and larger models as well as optimizing performance and efficiency for training and inference. This phase introduced Ray and Flink.
Phase 3 tackles LLM operations and brings in vLLM as a central piece of the platform.
Their tech stack features all of the core components described above along with Spark, Flink, Kubeflow, kServe, Triton, and many many other tools.
Workload | Container orchestrator | Distributed compute engine | Training & inference framework | Outcomes |
---|---|---|---|---|
Batch / offline inference Personalization, multimodal search, CLIP embeddings, LLM batch inference | Kubernetes using hybrid cloud, Yunikorn for queueing | Previously Spark or reusing Triton online inference services, now Ray | PyTorch, vLLM | 9x speedup over online LLM server, 58% cost reduction for large image CLIP inference. Enabled multistage processing, heterogeneous resources, 10x improved GPU utilization, greater fault tolerance, 1B requests per day. ✧ (blog, talk) |
Online LLM inference Assistant, chat translation, voice safety | Kubernetes using hybrid cloud | kServe | vLLM is the primary LLM inference engine, Triton | Switch to vLLM delivered 2x lower latency and higher throughput. ✧ (blog) |
Training pipelines Daily retraining, distributed training | Kubernetes using hybrid cloud, Yunikorn for queueing | Kubeflow Pipelines, Ray for distributed training | PyTorch | Platform expanded from 50 → 250 inference pipelines. ✧ (blog) |
Fig 4. Components of Roblox’s software stack for managing AI compute.
This last case study is not about a specific company, but instead illustrates the stack through an important emerging AI workload, namely post-training.
Post-training now centers on reinforcement learning and is notable due to its complexity. Compared with regular training, post-training involves a combination of model training as well as model inference. Inference is run in order to generate additional data for training. Model inference often interacts with simulation or an agentic environment (e.g., if you are teaching the model to do software engineering, you may be running and scaling containerized coding environments). Reward models score the quality of generated data. Model weights need to be shipped from the training portion of the algorithm to the inference portion, and rollouts (essentially execution traces) need to be shipped from the generation part back to the training part. As a result, there are many moving parts, and the distributed systems challenges are harder.
The table below describes the software stacks used to implement five of the most popular open source post-training frameworks. These frameworks are generally implemented by combining technologies at the distributed compute engine layer (primarily Ray) with technologies at the training and inference framework layer. They are designed to run across multiple container orchestrators (Kubernetes and SLURM). SLURM is especially popular for deploying these frameworks because researchers are the main people doing post-training right now.
Note that because this workload requires both training and inference, we separate out training frameworks from inference frameworks in this table.
Framework | Container orchestrator | Distributed compute engine | Training framework | Inference framework | Other comments |
---|---|---|---|---|---|
VeRL ByteDance | Documentation primarily for SLURM | Ray for scheduling and coordination of RL components | PyTorch FSDP, Megatron-LM (for very large models) | vLLM, SGLang | PPO, GRPO. Decouples data flow and compute. ✧ (GitHub) |
SkyRL UC Berkeley Sky Lab | Experiments run on Kubernetes | Ray for scheduling and coordination of RL components | PyTorch FSDP | vLLM, SGLang | Extends VeRL with agentic capabilities for tasks like SWE-Bench. ✧ (GitHub, blog) |
OpenRLHF OpenRLHF community | Scripts provided for SLURM | Ray for scheduling and coordination of RL components | DeepSpeed | vLLM | |
Open-Instruct AllenAI | Experiments run on Kubernetes-based Beaker platform | Ray for scheduling and coordination of RL components | PyTorch, DeepSpeed | vLLM | Fine-tuning, DPO, preference tuning, RLVR. ✧ (GitHub) |
NeMo-RL NVIDIA | Documentation primarily for SLURM, but also Kubernetes | Ray for scheduling and coordination of RL components | PyTorch FSDP, Megatron-LM (for very large models) | vLLM | Scalable post-training, up to 1000s of GPUs, 100B+ parameter models. ✧ (GitHub) |
Fig 5. Technologies used for building the most popular open source post-training frameworks. Ray, PyTorch, and vLLM are universally used in the implementation. These frameworks are typically deployed on top of Kubernetes and SLURM.
¹ Beyond the examples given in this blog post, others include Spotify, ByteDance, Tencent, Canva, Coinbase, Instacart, Niantic, Netflix, Runway, Reddit, Cohere, Ant Group, Samsara, eBay, Handshake, Workday, Zoox, Airbnb, Nubank, Attentive, and Apple.