Welcome to our second Ray meetup, where we focus on Ray’s native libraries for scaling machine learning workloads.
We'll discuss Ray Train, a production-ready distributed training library for deep learning workloads. And will present TorchX and Ray Integration. Through this integration, PyTorch developers can submit PyTorch-based scripts and workloads to a Ray Cluster using TorchX’s SDK and CLI via its new Ray Scheduler.
6:00 PM Welcome remarks, Announcements & Agenda by Jules Damji, Anyscale. Inc
6:05 PM “Ray Train: For production-ready distributed deep learning” Will Drevo, Amog Kamsetty & Matthew, Anyscale, Inc.
6:40 PM Q&A
6:50 PM “Large scale distributed training with TorchX and Ray” by Mark Saroufim, Meta AI
7:30 PM Q&A
Today, most frameworks for deep learning prototyping, training, and distributing to a cluster are either powerful and inflexible or nimble and toy-like. Data scientists are forced to choose between a great developer experience and a production-ready framework.
To fix this gap, the Ray ML team has developed Ray Train.
Ray Train is a library built on top of the Ray ecosystem that simplifies distributed deep learning. Currently, in stable beta in Ray 1.9, Ray Train offers the following features:
Scales to multi-GPU and multi-node training with zero code changes
Runs seamlessly on any cloud (AWS, GCP, Azure, Kubernetes, or on-prem)
Supports PyTorch, TensorFlow, and Horovod
Distributed data shuffling and loading with Ray Datasets
Distributed hyperparameter tuning with Ray Tune
Built-in loggers for TensorBoard and MLflow
In this session, we'll talk through some of the challenges in large-scale computer vision ML training, and show a demo of Ray Train in action.
Bios: Matthew Deng & Amog Kamsetty are senior software engineers at Anyscale in the Ray ML team. Will Drevo is ML product manager at Anyscale.
Large-scale model training has generally been out of reach for people in open source because it requires an engineer to learn how to set up an infrastructure, how to build composable software systems, and how to set up robust machine learning scripts.
To that end, we’ve built the TorchX Ray scheduler which leverages the newly created Ray Job API to allow scientists to focus on writing their scripts and making infrastructure and systems setup relatively easy.
1. Setting up a multi GPU setup on any cloud provider is as easy as calling ray up the cluster.yaml
2. TorchX embraces a component-based approach to designing systems that makes your ops workflows composable
3. Running a distributed PyTorch script is then as simple as calling torchx run
In this session, we’ll go through a practical live demo of how to train multi GPU models, set up the infrastructure live, and provide some tips and best practices to productionize such workflows
Bio: Mark Saroufim is a senior software engineer with the Meta AI and PyTorch Engineering Group and works on the PyTorch ecosystem.
Will is a Product Manager for ML at Anyscale. Previously, he was the first ML Engineer at Coinbase, and ran a couple of ML-related startups, one in the data labeling space and the other in the pharmaceutical space. He has a BS in CS and Music Composition from MIT, and did his master's thesis at MIT in machine learning systems. In his spare time, he produces electronic music, travels, and tries to find the best Ethiopian food in the Bay Area.
Amog Kamsetty is a software engineer at Anyscale where he works on building distributed training libraries and integrations on top of Ray. He previously completed his MS degree at UC Berkeley working with Ion Stoica on machine learning for database systems.
Matthew Deng is a software engineer at Anyscale where he works on distributed machine learning libraries built on top of Ray. Before that, he was a software engineer at LinkedIn. He holds a BS in Electrical Engineering and Computer Science from UC Berkeley.