Scalable machine learning,
Scalable Python, for everyone
Managing a data lake that your business depends on to continuously deliver critical insights can be a daunting task. From applying table upserts/deletes during log compaction to managing structural changes through schema evolution or repartitioning, there's a lot that can go wrong and countless trade-offs to weigh. Moreover, as the volume of data in individual tables grow to petabytes and beyond, the jobs that fulfill these tasks grow increasingly expensive, fail to complete on time, and entrench teams in operational burden. Scalability limits are reached and yesterday's corner cases become everyday realities. In this talk, we will discuss Amazon's progress toward resolving these issues in its S3-based data lake by leveraging Ray, Arrow, and Parquet. We will also review past approaches, subsequent lessons learned, goals met/missed, and anticipated future work.
Senior Software Engineer, Amazon