This blog post is part of the Ray Summit 2023 highlights series where we provide a summary of the most exciting talk from our recent LLM developer conference.
Disclaimer: Summary was AI generated with human edits from video transcript.
Netflix's machine learning platform utilizes heterogeneous training clusters to power its recommendation and content personalization systems. In this blog post, we delve into the details of how Netflix leverages Ray and GPU clusters for efficient model training, communication, and data management. We'll also explore the challenges they face and their strategies for resource allocation and GPU utilization.
Netflix uses personalized recommendations on its homepage to help users find relevant content. This involves ranking algorithms, customized artwork, etc.
They use various ML models for recommendations and computer vision tasks. Workloads include recommendation, media understanding, and large language models.
To optimize computation, they use custom operators, integrate latest ops from ML frameworks, and fuse sequences of ops.
For communication, they use NCCL for dense gradients but need custom solutions for sparse embedding/table gradients.
For I/O, they use local SSDs, S3 streaming, and FSx caching from S3 to get high throughput to GPUs.
They offload data loading to remote CPUs with TF data service and Ray to decouple it from GPU training.
They use durable heterogeneous clusters per team with autoscaling. Jobs specify only the number of GPUs needed.
They store data in S3, sync it to FSX for high-speed training access, and write logs/checkpoints to EFS.
They are working on a centralized scheduler, exploring batch inference, and moving to fully scheduled job submission.
Netflix, a global streaming giant, is known for its data-driven approach to content personalization. To achieve this, they employ a robust machine learning platform that relies heavily on GPU clusters for model training and recommendation systems. In this blog post, we'll explore heterogeneous training clusters at Netflix, focusing on their use of Ray and GPUs.
Heterogeneous training clusters are an essential component of Netflix's machine learning infrastructure. These clusters bring together different types of hardware, including both CPUs and GPUs, to address a range of machine learning workloads. By doing so, Netflix can efficiently support a variety of use cases, from recommendation systems to large language models.
At the heart of Netflix's success is its ability to personalize content recommendations for its users. The platform's homepage is highly personalized for each profile, and everything on the homepage revolves around recommendations. These recommendations are generated through complex machine learning models that analyze a user's viewing history and preferences.
Netflix's machine learning workloads can be categorized into three main types:
1. Recommendation Use Case: This involves models with dense features, concatenating with sparse features, and using multiple prescription MLP. It's characterized by large datasets and distributed training across multiple GPUs.
2. Multimodal Use Case: Netflix's researchers work with diverse data sources, including text, images, audios, videos, and more. These datasets can be extremely large, leading to the need for distributed data parallelism.
3. Large Language Models: This emerging use case involves training massive language models with small datasets. It's computationally intensive due to Transformer-based architectures.
To make the most of GPU clusters, Netflix employs a range of strategies:
1. Custom Operators: They use custom operators to improve computation efficiency, reducing the overhead associated with composing complex operations.
2. State-of-the-Art Operators: Netflix leverages state-of-the-art operators from TensorFlow and PyTorch to enhance memory usage and overall model performance.
3. GPU Communication: Netflix optimizes GPU communication, with considerations for different model sizes and batch sizes. They explore native TensorFlow and sparse weight like embedding tables to find the best fit for their workloads.
Effective data storage and management are crucial for machine learning workflows. Netflix employs a variety of strategies:
1. Local SSD Disk: They use local SSD disks for high-speed data access but acknowledge limitations in scalability.
2. S3 Streaming: S3 streaming provides a cloud-based storage solution but comes with performance limitations.
3. FSx for Lustre: This high-throughput file system helps in reading large datasets efficiently and integrates with the GPU clusters.
4. Data Loading Optimization: To optimize data loading and alleviate CPU bottlenecks, they offload data loading to remote CPUs using Ray.
Netflix employs resource allocation and scheduling strategies to ensure efficient use of their heterogeneous training clusters. They use the concept of resource pools, where they group machines according to their capabilities. This simplifies resource allocation for researchers who need GPU resources.
Despite Netflix's sophisticated infrastructure, they acknowledge challenges, including the need to accommodate diverse workloads and explore emerging GPU technologies. They are also working on a centralized job scheduler and queue system to maximize resource utilization and reduce contention between teams.
Netflix's use of heterogeneous training clusters with Ray and GPU technology is a testament to the power of AI and ML in the entertainment industry. It underlines the significance of efficient data management, computation, and resource allocation in delivering a personalized streaming experience. As machine learning and GPU technologies continue to advance, it will be exciting to see how Netflix evolves its infrastructure to keep providing top-notch content recommendations to its millions of viewers worldwide.