Wednesday, August 24
11:30 AM - 12:00 PM
The TorchX team has been collaborating with the Ray team on easy ways to launch elastic large-scale distributed training jobs using TorchX entirely from a notebook, and this is now available as an experimental project. There are a few problems that have traditionally prevented anyone from doing this, such as needing an easy way to submit jobs against that infrastructure, devising an easy way to monitor job status and aggregate logs from several machines, and integrating that job with the rest of your infrastructure in a modular way. TorchX components can make it much easier to reuse code and experiment with different training infrastructures while you experiment with taking your work from research to production without needing to write a single new line of code.
Mark Saroufim is a senior machine learning engineer at Meta focused on productionizing PyTorch in open source. He maintains pytorch/serve and has contributed to numerous other PyTorch projects. Mark also enjoys producing ML-related content on Substack, Medium, YouTube, and Twitch.
Come connect with the global community of thinkers and disruptors who are building and deploying the next generation of AI and ML applications.Save your spot