Trends in AI and Python Scalability: Reflections from the Ray Summit Program Chairs

By Dean Wampler and Ben Lorica   

LinkIntroduction

When we envisioned the program for the upcoming Ray Summit, we looked at trends in computing that we wanted to highlight. First, there’s been steady progress towards simplification, efficiency, and lower costs: think of cloud computing, microservices, serverless, and cloud native infrastructure.

Second, we were also cognizant of the growing importance of machine learning, particularly of deep learning and reinforcement learning, techniques that bring a host of challenges for developers. Thus we chose the following theme for the conference:

Scalable machine learning, scalable Python, for everyone.

Ray Summit

As we looked at the broader computing landscape, we noted a few key challenges that we highlight in this post. The reality is that developers and data scientists would rather not talk about servers, cloud instances, or distributed computing. In the ideal scenario, developers and data scientists focus on their code and their IT infrastructure will automatically take care of their computing needs, including scalability to clusters. 

Ray is well-positioned to make this desire a reality.  The first Ray Summit will showcase how many leading companies and open source communities are using Ray to overcome their scalability and distribution challenges.  We have speakers from leading open-source libraries, presentations from financial services, retail and e-commerce, manufacturing, media and advertising, software companies, and more.

But the conference isn’t just about Ray. It also features leading researchers and practitioners in machine learning and cloud-native services who are developing many of the practices and tools that all of us will be using in the near future.

LinkThe need for scale and distributed computing

In a post from earlier this year, Anyscale co-founder and UC Berkeley Professor, Ion Stoica, made a strong case for why companies will need access to tools for distributed computing. He cited training times for large neural models – including language models – as one of main reasons why architects have little choice but to have tools for distributing computations over a cluster. In particular, at this year’s Ray Summit, we have presentations from creators of some of the leading open source NLP libraries (including spaCy, Hugging Face, Spark NLP). They will explain how they used Ray to meet some of their distributed computing needs. 

But this need for distributed computing isn’t limited to NLP, of course. Companies increasingly need to scale model training and reinforcement learning across compute clusters.  We have many presentations on scaling ML, including sessions on distributed RL from Autodesk and presentations on Ray Tune (including Ray SGD, a new library that simplifies distributed training in PyTorch and TensorFlow), on real-world hyperparameter tuning, and on how Uber is using Ray in Horovod for distributed deep learning.

LinkThe need for fine-grained control and heterogeneous computing

Developers need to be able to optimize compute resources based on the tasks and workloads appropriate for their domains, not the constraints imposed by distributed computing frameworks.  In some situations – like training large neural models – they may need accelerators (GPUs, TPUs, ASICs). In other cases, including data processing, streaming, simulation, and machine learning, they may need to use many CPU cores.

Ray allows developers to describe the hardware resources they need (number of CPUs, GPUs, TPUs, and other hardware accelerators) globally or on a per-task basis. Future versions of Ray will even allow developers to specify the precise type of chip they prefer (e.g., “two V100 GPUs”). There will be many sessions that highlight how users of Ray improved performance and utilization using fine-grained control of hardware resources.

LinkAccess to intuitive distributed state management

Many applications, including streaming, data processing, simulation, and machine learning, benefit from distributed state management.  One can manage state using a database, and many developers go this route, but there are significant performance penalties and cost considerations that point to the need for alternative solutions. With that said, developers would rather not hand-code distributed state management themselves.

Ray “tasks”, for stateless computing, and “actors”, for stateful computing, are natural extensions to the familiar concepts of functions and classes in Python, which can encapsulate arbitrary compute activity and application state. Developers write code with the intuitive Ray API and Ray handles distribution over a cluster. Radically different tools for distributed computing vs. managing distributed state aren’t required.

There are several speakers who will describe how they used Ray to build stateful applications. This includes a presentation on a distributed online learning system from Oak Ridge National Laboratory. This particular team used Ray to build models used in real-time suicide prevention algorithms, fraud detection, and infectious disease surveillance.

LinkConverging towards MLOps

MLOps (Machine learning operations) has emerged as a speciality within the larger DevOps ecosystem, because of unique requirements for deploying and managing data science artifacts, especially models, in production.

Ant Financial, which runs the world’s largest Ray clusters currently, has several talks (here and here) at Ray Summit on their practical experience building and running ML-based applications. Weights and Bias will describe their experiences running and scaling hyperparameter tuning. Intel will discuss their extensive AutoML capabilities in Analytics Zoo. Seldon will discuss distributed black-box explanation for very large data sets. You can also hear how Primer built production NLP pipelines.

LinkReinforcement Learning Continues to Spread

Finally, RL has been hot ever since it was used to achieve expert-level game play in Atari games, used to beat the world’s best Go players, and used for many autonomous vehicle and robotic applications. Now it is spreading to other industries and applications. 

AWS will discuss using RL for supply chain management and recommender systems in Sagemaker. A forthcoming Anyscale blog post will also highlight AWS’s use of multi armed bandits for various systems. Pathmind will discuss applying RL to improve business process simulation. Similarly, Dow Chemical will discuss applying RL to industrial productions scheduling. Both Microsoft and Autodesk will discuss RL applications in autonomous systems and robotics.

LinkSummary

We curated Ray Summit to highlight important trends in ML/AI and distributed computing, with keynotes and sessions from experts in these areas to provide their insights on where we are headed. You’ll hear from Ray users developing applications for diverse domains, some of which we learned about for the first time while curating the Ray Summit program. You’ll also hear what’s new with Ray itself. We hope you’ll join us and learn all this for yourself.

Ray Summit 2020! Join the livestream or watch sessions on-demand. Register Today!

Next steps

Anyscale's Platform in your Cloud

Get started today with Anyscale's self-service AI/ML platform:


  • Powerful, unified platform for all your AI jobs from training to inference and fine-tuning
  • Powered by Ray. Built by the Ray creators. Ray is the high-performance technology behind many of the most sophisticated AI projects in the world (OpenAI, Uber, Netflix, Spotify)
  • AI App building and experimentation without the Infra and Ops headaches
  • Multi-cloud and on-prem hybrid support