Managing a data lake that your business depends on to continuously deliver critical insights can be a daunting task. From applying table upserts/deletes during log compaction to managing structural changes through schema evolution or repartitioning, there's a lot that can go wrong and countless trade-offs to weigh. Moreover, as the volume of data in individual tables grow to petabytes and beyond, the jobs that fulfill these tasks grow increasingly expensive, fail to complete on time, and entrench teams in operational burden. Scalability limits are reached and yesterday's corner cases become everyday realities. In this talk, we will discuss Amazon's progress toward resolving these issues in its S3-based data lake by leveraging Ray, Arrow, and Parquet. We will also review past approaches, subsequent lessons learned, goals met/missed, and anticipated future work.
Patrick is a Senior Software Engineer working on Big Data Technologies / Data Management and Optimization at Amazon. He spent the last 13 years of his life wrestling big data problems across data lakes, data warehouses, databases, financial transaction processing systems, and infosec.