HomeBlogBlog Detail

Data Processing is Becoming a GPU Workload

By Robert Nishihara   |   June 16, 2026

For decades, the mental model for a data pipeline was straightforward. You took raw data, parsed it, filtered it, joined it, and wrote the result somewhere useful. The dominant abstraction was tables, the dominant language was SQL, and the compute substrate was a cluster of homogeneous CPU machines.

That world is not going away. 

But a structural shift is underway across the data landscape. A growing fraction of high-value data processing no longer looks like traditional ETL. It is moving to GPUs.

The logic driving this is simple. Vast quantities of actionable, high-value information are stored as unstructured, multimodal data. A modern pipeline might read every executed corporate contract to identify hidden business risks, extract product insights from video recordings of external customer meetings, or analyze internal productivity bottlenecks across Slack and email threads. It might embed every document, transcribe every conversation, or run a vision-language model over petabytes of robotic sensor data.

Historically, companies could store this information, search metadata around it, or build narrow task-specific systems, but they could not deeply understand the raw data at scale. You are not going to run a standard SQL query on a raw video or understand a robot’s trajectory with a group-by statement.

The way to process this kind of data is with models. Today’s multimodal models and embedding models make it possible to search, extract, and structure data of all types. This means running inference, and inference runs on GPUs. As a result, data processing is becoming inference heavy, and therefore GPU heavy.

Existing big data systems were built for an era of computing on homogeneous CPU machines. This shift is introducing a new category of systems challenges. For organizations that solve them, the opportunity is to extract value from fundamentally new sources of data.

LinkThe three shifts behind GPU data processing

Three related shifts are happening at once.

  • Tabular to multimodal: Historically, most data that businesses could effectively work with was nicely structured in tables. Unstructured data like video, audio, PDFs, and sensor data were too unwieldy to process at scale, so it was dumped into storage and ignored. The rich, unstructured formats that used to be inaccessible programmatically are suddenly the primary sources of new insights.

  • SQL to inference: SQL is the primary way that companies manipulate tabular data. It is incredibly powerful, but it is not the right tool for working with a video or a PDF or robotic sensor data or genomic sequence data. For these highly unstructured data types, the right tool is model inference. Inference is becoming the central subroutine in complex data pipelines that fuse model execution with regular processing.

  • CPUs to GPUs: Multimodal data processing is inference heavy, and inference runs on GPUs. While the scale of the CPU processing is often much greater, the GPU stages are frequently much more costly and therefore more important to optimize.

Three shifts in data processing
Three shifts in data processing

Data processing is undergoing three major shifts.

Inference creates structure from unstructured data. In many pipelines, GPU data processing does not replace traditional data processing. Instead, it makes traditional data processing possible on new sources of data. Inference is used to extract labels, embeddings, classifications, summaries, entities, or structured records from unstructured data. Once that structure exists, SQL engines, Spark jobs, vector databases, and other traditional tools become useful downstream.

LinkWhy Now?

Two underlying trends are accelerating the shift toward GPU data processing.

LinkData Curation is Now Model-Driven

As model quality increases, the marginal utility of low-quality data decreases, and the importance of improved curation techniques grows. While traditional curation relied on simple heuristic filters such as word frequencies, excessive repetition, or length constraints, modern curation is increasingly model based. The process of judging the quality of data, or of rewriting it to improve that quality, is an inference task.

Data quality is a moving target. The quality level of data that was sufficient to train one model may not be sufficient to train a next-generation model. As model quality increases, data quality must increase in tandem. As this quality bar continues to rise, the amount of inference used broadly across training data preparation and curation will only grow with it.

Dataset curation full picture
Dataset curation full picture

Netflix’s dataset curation workflow. More details in this talk.

LinkScaling with Compute, Not Just Data Volume

We are used to thinking about scaling a training run by increasing the volume of training data. Data scale still matters, but it is only one ingredient. Compute is the other ingredient, and a growing number of techniques are essentially methods for turning compute into high-quality data.

This happens in a few ways:

  • Model-driven curation: The quality refinement process described above.

  • Synthetic data and simulation: Using models or simulators to generate new data for training.

  • Reinforcement learning: A data-efficient technique for using a model in conjunction with an environment, simulator, tool, or evaluator to generate training data.

  • Reasoning and continual improvement loops: Performing large quantities of reasoning, evaluation, or execution, and transforming those traces into lessons that can be fed back into the model’s context.

These methods turn inference into data generation. This trend will drive an even greater need for GPUs across the entire data processing lifecycle.

LinkWhy this is not just Spark on GPUs

Forcing modern, inference-heavy AI workloads into existing big data architectures exposes severe limitations. These are the types of systems challenges we designed Ray and Anyscale to solve.

LinkHomogeneous clusters cause underutilization

AI data processing is highly heterogeneous. A standard pipeline might have preprocessing operations that are CPU-bound or memory-bound, while the inference step is GPU-bound. Even within inference itself, disaggregating prefill and decode means the prefill phase may be bound by GPU compute, while the decode phase is bound by GPU memory bandwidth. Putting these operations into a single pipeline requires disaggregating the stages of compute, selecting the right compute shape for each stage, and right-sizing each compute pool. Attempting to coerce these diverse workloads into a single, homogeneous instance shape can lead to severe underutilization of your most expensive hardware.

LinkPipelines need streaming execution, not stage barriers

Traditional bulk synchronous parallel systems, like Apache Spark, proceed one stage at a time and often materialize intermediate data. GPU data processing is most efficient when different stages of compute overlap. Proceeding one stage at a time with strict barriers means that your GPUs may sit idle while the CPU stages run. Also, GPU nodes and GPU memory are limited resources and so materializing data between stages may not even be an option.

LinkExtreme stragglers are the norm

In traditional data processing, each stage of computation is relatively predictable. In an AI pipeline, that predictability disappears. A single stage may consist of LLM inference with highly variable input and output lengths. Alternatively, it could involve an entire agentic loop, such as cloning a codebase, patching the code, compiling it, and running a test suite. In these scenarios, the amount of time it takes to process different rows can vary wildly.

LinkI/O now means APIs

In AI pipelines, I/O often means calling external APIs, vector databases, or models hosted in other clusters. Handling these API-bound stages requires heavy multithreading or asynchrony. The system needs high concurrency without overwhelming downstream services. It needs backpressure to avoid global rate limits. It needs retries and timeout handling. And it needs to compose these API-bound stages with CPU and GPU stages in the same pipeline.

LinkCase Studies in the Wild

This pattern is visible across the Ray community and Anyscale users. Industry leaders have already rebuilt their stacks for this shift.

LinkMultimodal data curation

Preparing massive datasets for training next-generation foundation models is one of the fastest growing workloads we see today.

  • Netflix runs complex multimodal data curation (talk).

  • Alibaba built Data-Juicer, a foundation model data preparation system (docs).

  • Nvidia developed NeMo Curator, an open-source framework for preprocessing and curating text, audio, images, and video (talk).

LinkAudio, video, and image processing

Processing media data at scale requires infrastructure designed to handle heterogeneity.

  • ByteDance runs massive video and audio pipelines (talk), as well as multimodal embedding computations (blog).

  • Apple manages petabyte-scale multimodal data processing pipelines (talk).

  • xAI runs large-scale image and video processing (talk).

  • Runway powers its generative video processing pipelines (blog).

LinkRobotics and autonomy

Robotics and autonomy pipelines combine sensor streams, trajectories, logs, simulation outputs, labeling, filtering, and model inference.

  • Motional processes petabytes of autonomous drivelogs (talk).

  • Physical Intelligence orchestrates its robotics training data preparation (blog).

  • Bedrock Robotics manages its training data preparation pipelines (blog).

LinkLarge-scale batch inference

Large-scale batch inference is a core primitive for generating embeddings, annotations, or other forms of structure.

  • Roblox orchestrates batch inference workloads at scale (blog).

  • Pinterest achieves order-of-magnitude cost reductions for their batch inference workloads (blog).

  • Notion computes large-scale document embeddings (blog).

  • Applied Intuition runs complex batch inference workloads (talk).

Nvidia’s NeMo Curator is an open-source framework
Nvidia’s NeMo Curator is an open-source framework

Nvidia’s NeMo Curator is an open-source framework for preprocessing and curating all modalities of data including text, audio, images, and video. It illustrates many of the nuances of modern AI data processing.

LinkConclusion

As organizations begin extracting more value from various forms of multimodal data, they will begin collecting far more of it. As the data volume grows, the architectural gravity of data engineering will continue to shift toward accelerated, heterogeneous hardware.

Explore Anyscale today

Build, run, and scale any AI workload on Ray with a multi-cloud platform built for production AI.