The videos and slides are now available for the second Ray Summit Connect, June 17, 2020.
For your convenience, there are separate videos for each talk and for the Q&A and panel discussion at the end.
Trading off Model Size and Accuracy for BERT with Ray and SigOpt,
Q&A and panel discussion with Richard Liaw, Liam Li, and Meghana Ravikumar, moderated by Dean Wampler: video.
The vision of AutoML is to remove as much manual effort and required expertise as possible when applying machine learning and artificial intelligence to real-world problems. This Ray Summit Connect explores two topics of AutoML, where research is ongoing, but many practical tools already exist. The first topic is hyperparameter tuning, techniques for determining the optimal model structure to use for your problem, before actual model training begins. The second topic is actually an important subset of hyperparameter tuning, neural architecture search, which seeks the optimal architecture for your neural network. The third topic explores the pragmatic decision of finding the best balance between model size, which requires more computation, and accuracy, which improves with larger models. You’ll hear from experts on the challenges of these topics and the current tools and techniques used for them.
10:00 AM: Distributed Hyperparameter Tuning, Richard Liaw (Anyscale)
10:15 AM: Geometry-Aware Gradient Algorithms for Neural Architecture Search, Liam Li (Determined AI)
10:30 AM: Trading off Model Size and Accuracy for BERT with Ray and SigOpt, Meghana Ravikumar (SigOpt)
10:45 AM: Panel discussion moderated by Dean Wampler with audience Q&A
Distributed Hyperparameter Tuning, Richard Liaw
Modern deep learning model performance is very dependent on the choice of model hyperparameters, and the tuning process is a major bottleneck in the machine learning pipeline. In this talk, we will first motivate the need for advancements in hyperparameter tuning methods. The talk will then overview standard methods for hyperparameter tuning: grid search, random search, and bayesian optimization. Then, we will motivate and discuss cutting edge methods for hyperparameter tuning: multi-fidelity bayesian optimization, successive halving algorithms (HyperBand), and population-based training. The talk will then present an overview of Tune (http://tune.io/), a scalable hyperparameter tuning system from the UC Berkeley RISELab, and demonstrate about how users can leverage cutting edge hyperparameter tuning methods implemented in Tune to quickly improve the performance of standard deep learning models.
Geometry-Aware Gradient Algorithms for Neural Architecture Search, Liam Li
Deep learning offers the promise of bypassing the process of manual feature engineering by learning representations in conjunction with statistical models in an end-to-end fashion. However, neural network architectures themselves are typically designed by experts in a painstaking, ad-hoc fashion. Neural architecture search (NAS) presents a promising path for alleviating this pain by automatically identifying architectures that are superior to hand-designed ones. In this talk we will present our recent GAEA framework, which provides principled and computationally efficient algorithms for NAS that yield SOTA performance on a wide range of leading NAS benchmarks in computer vision. We will also briefly discuss practical infrastructural hurdles associated with large-scale NAS workflows, and how we tackle these hurdles with Determined AI’s open-source training platform.
Trading off Model Size and Accuracy for BERT with Ray and SigOpt, Meghana Ravikumar
With the publication of BERT, transfer learning was suddenly accessible for NLP, unlocking a plethora of model zoos and boosting performances for domain specific problems. Although BERT has accelerated many modeling efforts, its size is limiting for federated learning, edge computing, and for some production systems. In this talk, we will explore how to reduce the size of BERT while retaining its capacity in the context of Question Answering tasks.
Our approach encompasses fine-tuning, distillation, and hyperparameter optimization at scale. First, we fine-tune BERT on SQUAD 2.0 (our teacher model) and use distillation to compress fine-tuned BERT to a smaller model (our student model). Then, combining SigOpt and Ray, we use multimetric hyperparameter optimization at scale to find the optimal architecture for the student model. Finally, we explore the trade-offs of our hyperparameter decisions to draw insights for our student model’s architecture.