Building a Multimodal Video Processing Pipeline with Ray

Thursday, May 28 8:30 AM PDT | 11:30 AM EDT | 5:30 PM CEST

Curating high-quality video data is one of the hardest problems in modern AI, it's CPU-heavy in some stages, GPU-heavy in others, and traditional staged pipelines leave expensive accelerators idle most of the time. Ray Data solves this with streaming execution and heterogeneous scheduling, letting you fuse CPU preprocessing, vision-language model annotation, and embedding generation into a single pipeline where every resource stays busy.

Join us for a live, instructor-led hands-on lab where ML engineers, data engineers, and platform engineers will build a production-style multimodal video curation pipeline end-to-end on the Anyscale Platform. You'll start from raw videos streamed directly from Hugging Face's FineVideo dataset and finish with a curated, semantically-annotated, embedding-ready Parquet dataset, the exact kind of asset used to train modern VLMs and video foundation models.

In this session, you'll learn:
- Build and scale data pipelines with Ray
- What is video data curation
- Stream large datasets from remote sources at scale
- Run distributed GPU inference with Ray Data
- Scale embedding generation with CPU actor pools
- Compose CPU and GPU stages into one streaming pipeline with Ray