ML Infra + Apps

Large-scale distributed training with TorchX and Ray

Ray Summit 2022

The TorchX team has been collaborating with the Ray team on easy ways to launch elastic large-scale distributed training jobs using TorchX entirely from a notebook, and this is now available as an experimental project. There are a few problems that have traditionally prevented anyone from doing this, such as needing an easy way to submit jobs against that infrastructure, devising an easy way to monitor job status and aggregate logs from several machines, and integrating that job with the rest of your infrastructure in a modular way. TorchX components can make it much easier to reuse code and experiment with different training infrastructures while you experiment with taking your work from research to production without needing to write a single new line of code.

About Mark

Mark Saroufim is a senior machine learning engineer at Meta focused on productionizing PyTorch in open source. He maintains pytorch/serve and has contributed to numerous other PyTorch projects. Mark also enjoys producing ML-related content on Substack, Medium, YouTube, and Twitch.

Mark Saroufim

Machine Learning Engineer, Meta
chucks
Ray Summit 2022 horizontal logo

Ready to Register?

Come connect with the global community of thinkers and disruptors who are building and deploying the next generation of AI and ML applications.

Save your spot
register-bottom-mobile
beanbags

Join the Conversation

Ready to get involved in the Ray community before the conference? Ask a question in the forums. Open a pull request. Or share why you’re excited with the hashtag #RaySummit on Twitter.