Ray Summit 2022
At Uber, machine learning powers many business decisions to bring about experiences that delight our customers. To empower our product development teams with ML technology, we have built the Michelangelo Machine Learning Platform as an end-to-end solution for standardizing ML application development from exploration to production. In Michelangelo's early years, we centered our technology stack on Apache Spark, a large-scale data processing and classical ML training platform. Recent mass adoption of deep learning by our product teams has driven us to rethink our Spark-based technical architecture. Compared to classical ML techniques such as linear/logistics regression and tree-based methods, deep learning poses significant new infrastructure challenges in compute, network, and storage. In this talk, we will present case studies to highlight some of the specific challenges in large-scale deep learning training, and how we address them by leveraging Ray, a distributed compute platform.
Xu Ning, PhD, is a senior engineering manager at Uber leading multiple software development teams in Uber's Michelangelo Machine Learning Platform. His teams are major contributors to Uber's open-source deep learning software: Horovod distributed training framework (horovod.ai), Petastorm data access library (github.com/uber/petastorm), and Neuropod unified inference library (neuropod.ai). Before Michelangelo, he led Uber's Cherami distributed task queue and the Hadoop observability and data security teams. Prior to Uber, he held engineering positions at Facebook, Akamai, and Microsoft. His interests are machine learning, deep learning, MLOps, big data, and infrastructure engineering. His latest work at Uber is published at https://eng.uber.com/author/xu-ning/.
Michael Mui is a staff software engineer on Uber's Machine Learning Platform team, working on distributed training infrastructure, hyperparameter optimization, model representation, and evaluation.
Di Yu is a machine learning system engineer focused on large-scale distributed training systems that enable modern deep learning models to run with high throughput and reliability. Before joining Uber, Di worked as an ML system engineer at Facebook for three years, and before he worked at ServiceNow on distributed systems.
Come connect with the global community of thinkers and disruptors who are building and deploying the next generation of AI and ML applications.
Save your spot