Ray Logo

June 22-24 | Virtual & Free

Ray Summit 2021

Scalable machine learning,
Scalable Python, for everyone


Petabyte Scale Datalake Table Management with Ray, Arrow, Parquet, and S3

June 22, 12:25 PM - 12:55 PM

Managing a data lake that your business depends on to continuously deliver critical insights can be a daunting task. From applying table upserts/deletes during log compaction to managing structural changes through schema evolution or repartitioning, there's a lot that can go wrong and countless trade-offs to weigh. Moreover, as the volume of data in individual tables grow to petabytes and beyond, the jobs that fulfill these tasks grow increasingly expensive, fail to complete on time, and entrench teams in operational burden. Scalability limits are reached and yesterday's corner cases become everyday realities. In this talk, we will discuss Amazon's progress toward resolving these issues in its S3-based data lake by leveraging Ray, Arrow, and Parquet. We will also review past approaches, subsequent lessons learned, goals met/missed, and anticipated future work.

Watch video >>>


Patrick Ames

Patrick Ames

Senior Software Engineer, Amazon