Inference Graphs at LinkedIn Using Ray-Serve

By Anyscale Ray Team   

Update June 2024: Anyscale Endpoints (Anyscale's LLM API Offering) and Private Endpoints (self-hosted LLMs) are now available as part of the Anyscale Platform. Click here to get started on the Anyscale platform.

This blog post is part of the Ray Summit 2023 highlights series where we provide a summary of the most exciting talk from our recent LLM developer conference.

Disclaimer: Summary was AI generated with human edits from video transcript.

LinkKey Takeaways

LinkedIn's AI platform is undergoing a transformation, leveraging the power of Ray and Inference Graphs to simplify complex AI workflows. In this blog post, we delve into a recent talk by LinkedIn's Ali and Sasha, highlighting the challenges they faced, the solutions they've implemented, and how they're using Ray to streamline AI deployment. Discover how Ray is helping LinkedIn's AI platform achieve better integration, resource utilization, and seamless transitions between online and offline inference.

LinkInference Graphs at LinkedIn Using Ray-Serve

LinkedIn, the global professional network, has always been at the forefront of leveraging technology and AI to enhance the user experience. As the demand for AI services and tools continues to grow, LinkedIn is taking a bold step by optimizing its AI infrastructure to ensure faster and more cost-efficient development of AI and ML applications across both CPU and GPU. In this blog post, we'll explore how LinkedIn is revolutionizing its AI infrastructure with the help of Ray and Inference Graphs.

LinkChallenges in AI Infrastructure

In a recent talk by Ali and Sasha from LinkedIn's AI platform team, they discussed the challenges they encountered and how they are addressing them. These challenges span a wide range of areas, from managing complex AI workflows to integrating multiple programming languages and handling diverse use cases.

One of the key challenges they highlighted was the need to adapt to the ever-evolving AI landscape. The custom stack they were using added friction when trying to adopt new trends and technologies quickly. LinkedIn needed a more agile infrastructure that could keep up with the rapidly changing AI field.

Another challenge was the transition from homogeneous to heterogeneous infrastructure. With GPUs becoming increasingly important, LinkedIn wanted to build an infrastructure that could seamlessly adapt to diverse GPU configurations, rather than replicating the same multi-tenancy setup multiple times.

Auto-scaling and resource utilization were also on their radar. They needed a solution that allowed them to scale resources efficiently based on demand while minimizing wastage. Additionally, they wanted to simplify the process of moving from offline to online inference, a task that typically required substantial code rewriting.

LinkThe Introduction of Ray and Inference Graphs

LinkedIn's solution to these challenges involved the adoption of Ray and the concept of Inference Graphs. Ray, an open-source distributed computing framework, is a powerful tool for building and deploying AI applications. It's designed to make it easier to scale AI workloads efficiently across multiple CPUs and GPUs.

In the context of LinkedIn's AI platform, Inference Graphs refer to the graphical representation of AI workflows. These graphs encapsulate the entire process, from data preprocessing to model inference and post-processing steps. With this approach, LinkedIn aims to simplify AI workflows, allowing AI engineers to focus on high-level authoring and deployment without getting bogged down in low-level details.

LinkThe Ray-Based Inference Runtime

LinkedIn's Ray-based Inference Runtime provides a heterogeneous infrastructure where each inference graph operates as an independent unit. These graphs can be replicated differently to meet specific needs, making it possible to scale out control for A/B experiments or other use cases.

To enable seamless transition between offline and online inference, LinkedIn has integrated Ray into its experimentation framework. This integration ensures that AI engineers can roll out their models to production without significant hurdles.

Nodes in the Inference Graph are split across multiple graphs, and each graph can be further divided into multiple nodes. These nodes can vary based on whether the workload involves CPU or GPU computation. This setup helps isolate data-intensive workloads to CPUs while offloading inference tasks to GPUs for optimal resource utilization.

LinkedIn has also integrated this infrastructure with load balancing and other essential components to ensure a robust AI deployment.

LinkHigh-Level Authoring of Inference Graphs

To simplify the authoring of Inference Graphs, LinkedIn has developed an internal platform called Proxima. This platform empowers AI engineers to independently author, deploy, and run their inference graphs. It's tightly integrated with the CI/CD pipelines, experimentation systems, and feature infrastructure, enabling seamless collaboration and deployment.

The Proxima platform utilizes Python for intuitive graph configuration. AI engineers can create graphs by combining objects representing models, pre-processing, post-processing, and feature fetching steps. This high-level authoring approach allows AI engineers to work efficiently without delving into the intricacies of low-level implementation.

LinkThe Impact on AI Workloads

With the introduction of Ray and Inference Graphs, LinkedIn has seen significant changes in the way it handles AI workloads. For LLN (Low Latency Network) use cases, the overhead of the new infrastructure is minimal, often measured in milliseconds. The primary focus for these use cases is the GPU-based model inference, and the infrastructure does not introduce substantial overhead.

In the case of personalization models, where latency is crucial, LinkedIn has optimized its infrastructure to minimize serialization and deserialization costs. This optimization is achieved by reducing the Python components in the call path and introducing Java-based runtimes. The difference between Python and Java in terms of performance is substantial, particularly when it comes to feature fetching.

LinkedIn is actively exploring solutions to further reduce the overhead of serialization and deserialization. The goal is to maximize the efficiency of AI workloads, even for models that require high performance.

LinkOnline and Offline Inference

One of the remarkable achievements of LinkedIn's Ray-based infrastructure is the ability to use the same inference graph for both online and offline inference. While the scaling configurations differ between the two, the inference graph remains consistent. This means that AI engineers can create a single graph and run it in both environments, significantly simplifying the development and deployment process.

For offline inference, LinkedIn leverages Ray data sets to scale out the workload efficiently. This approach allows them to handle large-scale batch processing with ease. By co-locating actors and optimizing the placement of objects, they aim to minimize the cost of transmitting data over the network.

LinkThe Future of LinkedIn's AI Infrastructure

LinkedIn's journey to optimize its AI infrastructure with Ray and Inference Graphs is ongoing. They are actively working on addressing the challenges and improving the performance of their AI workloads. This includes exploring solutions like Java-based runtimes, Native C++, and fine-tuning the co-location of objects to reduce serialization costs further.

The platform they've built is not only streamlining the deployment of AI models but also making it easier to integrate AI into the existing LinkedIn ecosystem, where Java is predominant. It offers AI engineers the tools and high-level authoring capabilities they need to be productive without worrying about low-level implementation details.

LinkConclusion

LinkedIn's journey to revolutionize its AI infrastructure with Ray and Inference Graphs is an exciting development in the field of AI and ML. This transformation addresses the challenges faced by LinkedIn's AI platform, providing a more agile, cost-efficient, and performance-driven solution for AI engineers.

Ray's distributed computing framework has proven to be a valuable asset in this transformation, enabling LinkedIn to seamlessly scale AI workloads, optimize resource utilization, and simplify the transition between online and offline inference. As LinkedIn continues to fine-tune its infrastructure and explore further optimizations, we can expect to see even greater improvements in AI deployment efficiency.

LinkedIn's commitment to innovation and technology is evident in this transformation, and it serves as a compelling example of how organizations can leverage cutting-edge technologies to enhance their AI capabilities and ultimately provide better services to their users.

Sign up now for Anyscale endpoints and get started fast or contact sales if you are looking for a comprehensive overview of the Anyscale platform.

Ready to try Anyscale?

Access Anyscale today to see how companies using Anyscale and Ray benefit from rapid time-to-market and faster iterations across the entire AI lifecycle.