Ray Logo
Anyscale

June 22-24 | Virtual & Free

Ray Summit 2021

Scalable machine learning,
Scalable Python, for everyone

LIVE (WATCH NOW)
summit-background

Petabyte Scale Datalake Table Management with Ray, Arrow, Parquet, and S3

June 22, 12:25 PM - 12:55 PM

Managing a data lake that your business depends on to continuously deliver critical insights can be a daunting task. From applying table upserts/deletes during log compaction to managing structural changes through schema evolution or repartitioning, there's a lot that can go wrong and countless trade-offs to weigh. Moreover, as the volume of data in individual tables grow to petabytes and beyond, the jobs that fulfill these tasks grow increasingly expensive, fail to complete on time, and entrench teams in operational burden. Scalability limits are reached and yesterday's corner cases become everyday realities. In this talk, we will discuss Amazon's progress toward resolving these issues in its S3-based data lake by leveraging Ray, Arrow, and Parquet. We will also review past approaches, subsequent lessons learned, goals met/missed, and anticipated future work.

Speakers

Patrick Ames

Patrick Ames

Senior Software Engineer, Amazon