Ray Use Cases

Evaluating large language models with Ray in hybrid cloud

Ray Summit 2022

Evaluation of large-scale neural language models is crucial but also challenging. It requires models to be fine-tuned and evaluated on a large number of downstream tasks. Such downstream tasks can be very different in problem domain, type of input data, adoption pipeline, and running environment. Therefore, they are usually handled in different pipelines by different research teams. For example, one team might run a GLUE Benchmarking pipeline with 9 sub-tasks while another team might run a Sentiment Analysis pipeline with 17 sub-tasks. Such multi-task evaluation can be both time consuming (it can take a few days or even more) and hard to manage (across different teams and pipelines).

Therefore, a toolkit that can support a unified pipeline for multi-task with easy scaling and independent resource management is highly desirable in this domain. By adopting Ray into our pipeline, we achieved easier auto-scaling, better resource management, and unified workflows for different tasks. In this talk, we will walk through the problem and demonstrate how we run our large-scale language model evaluation pipeline in a hybrid cloud with auto-scaling, and cover some details on how Ray helps unify the workflow pipeline with easy code modifications to achieve auto-scaling, dependency management, and better overall performance.

This work is being developed as part of IBM's project CodeFlare.

About Linsong

Linsong Chu is a research engineer at IBM Research.

Linsong Chu

Research Engineer, IBM
Ray Summit 2022 horizontal logo

Ready to Register?

Come connect with the global community of thinkers and disruptors who are building and deploying the next generation of AI and ML applications.

Save your spot

Join the Conversation

Ready to get involved in the Ray community before the conference? Ask a question in the forums. Open a pull request. Or share why you’re excited with the hashtag #RaySummit on Twitter.