Apache Spark has long been the go-to engine for large scale data processing, powering everything from BI to ETL pipelines. But as data teams increasingly work with complex transformations and diverse data types such as text, image, video, and audio, new processing patterns have emerged relying on Python UDFs and AI models. This is where Ray, the modern engine for distributed Python, can help – supporting all types of data and AI models, from traditional workloads to the cutting edge.
Join this session to learn how to enhance your investments in open data formats and governance frameworks by seamlessly integrating Anyscale and Ray into your data platform.
How to process unstructured and structured data with Ray Data
Ray’s actor-based execution model and how it parallelizes image, text, and audio workloads at scale, improving hardware utilization.
Integration patterns for reading data from Unity Catalog with Ray Data, performing AI-powered data transformations, and writing back to Unity Catalog.
Ray Ecosystem Benefits: downstream use cases unlocked by making Ray’s compute backend available to developers.
Live demo: How to run data pipelines such as embedding processing with Ray
Production Considerations: Considerations for integrating Ray into a production data platform, and how Anyscale provides the shortest and best path to production.
Existing Spark users and platform architects across data engineering, data science, research, and ML engineering.