HomeBlogBlog Detail

Breaking the RAG Bottleneck: Scalable Document Processing with Ray Data and Docling

By Ana Biazetti (RedHat), Richard Liaw and Cathal O’Connor (RedHat)   |   February 27, 2026

LinkExecutive Summary

Enterprise teams often struggle with the "data bottleneck" when building generative AI (GenAI) applications like RAG (Retrieval-Augmented Generation), as traditional document processing tools fail to handle thousands of complex documents efficiently. This blog explores how a unified infrastructure—combining Ray Data for high-speed streaming and Docling for precise document parsing—removes these hurdles. By scaling these tools on platforms such as RedHat Openshift AI or Anyscale, organizations can transform messy, unstructured data into actionable insights in hours rather than days,  laying the foundation of trust and reliability for the next wave of AI innovation.


LinkThe Reality of the RAG Data Bottleneck

Demos make building generative AI look easy, but the reality of data preparation and processing is much harder. Imagine your team just inherited tens of thousands of legacy PDFs, and the CEO wants them searchable asap, processing 10,000+ complex documents with tables and images can quickly become a bottleneck that takes weeks to clear. The "unsexy" reality is that most AI projects spend the majority of their time wrestling with data preparation rather than tuning models.

The biggest hurdle in RAG development is the inefficiency of legacy data pipelines. RAG enhances LLM responses by retrieving relevant context from a knowledge base. Documents are processed (parsed, chunked, embedded) and stored in a vector database. At query time, relevant chunks are retrieved and provided as context to the LLM, enabling accurate answers grounded in your organization's data, as show in the figure below:

RAG Data processing, retrieval and generation workflow
RAG Data processing, retrieval and generation workflow

Traditional data processing frameworks often fail to meet the needs of AI because they cannot effectively coordinate the different compute requirements of Data Processing Flow with parsing of documents and embedding. To scale AI, enterprise teams must move toward a unified infrastructure that handles both CPU-heavy parsing and GPU-heavy embedding in a single, streamlined process.


LinkScaling with Ray Data and Docling

Ray Data, , is a distributed processing library built specifically for AI and machine learning workloads. Its streaming execution engine pipelines data across CPU and GPU tasks, maximizing GPU utilization while keeping memory usage constant. Because it is Python-native, it eliminates the serialization overhead of translating data between different language environments, enabling faster iteration cycles for RAG pipelines.

Docling handles the complex parsing that traditional tools often get wrong, ensuring your AI has the right context to provide useful answers. By accurately parsing tables and layouts in PDFs, Docling preserves the semantic structure that makes RAG retrieval actually useful. When integrated with Ray Data, each node runs a Docling instance with embedded expert AI models in memory (which process layouts and tables, for example), allowing for high-performance distributed document processing.

Ray Data streamlines large-scale processing by partitioning datasets into blocks, which are streamed through a cluster to enable massive parallelism. In this architecture, a Ray Data Driver manages the execution plan and serializes task code (like Docling processing) for distribution, while the Ray Workers handle the actual compute. These workers read data blocks directly from storage and execute parallel writes of the resulting JSON files to the destination, ensuring the Driver never becomes a bottleneck, as seen on the Architecture below

Ray distributed document processing with Docling
Ray distributed document processing with Docling

LinkAI Data Processing Architecture

  • Ray Driver: Manages the ObjectRefs and execution plan, serializing the Docling code for the workers.

  • Streaming Blocks: The Ray Workers pull data directly from input storage in parallel.

  • Parallel Writing: Each worker writes its processed JSON output directly to storage, ensuring the Ray Driver isn't overwhelmed by data throughput.

The integration handles all the distributed complexity, including scheduling, failure recovery, and memory management, automatically. The use of heterogeneous compute allows CPUs to parse while GPUs simultaneously embed data, ensuring expensive GPU resources stay fully utilized throughout the process.


LinkRunning Ray on Kubernetes with KubeRay

The kubernetes enterprise foundation is provided through KubeRay, orchestrating Ray clusters on Kubernetes with built-in reliability and security. KubeRay handles operational complexities such as dynamic cluster autoscaling, fault tolerance,  and automatic recovery if worker nodes fail. This allows you to scale from 10 to 100 nodes transparently to meet the demands of large ingestion jobs.

The end-to-end flow is straightforward, as shown in the figure below:

Ray Data pipeline for document processing on kubernetes
Ray Data pipeline for document processing on kubernetes
  1. Documents (for instance, PDFs) land in object storage (such as S3, PVC).

  2. Ray Data pipeline reads these documents and distributes them across worker nodes.

  3. Docling parses the documents on worker nodes, followed by chunking for the embedding model.

  4. Embeddings are generated on GPU nodes and written to a vector database like Milvus 

  5. A RAG application queries the database, feeding context to an LLM to generate accurate responses.

Running on KubeRay  keeps your data processing within your Kubernetes cluster, meeting data residency requirements and enabling deployment in virtual private clouds or on-premises environments.This unified infrastructure reduces operational overhead by letting you run data preparation and model serving on the same platform.


LinkFuture Outlook: Moving Toward Agentic Solutions

The future of enterprise AI depends on moving past simple search toward sophisticated agentic solutions. Organizations must focus on industrializing their data pipelines to support multi-step agentic workflows, where autonomous agents use RAG and Retrieval-Augmented Fine-Tuning (RAFT) to solve complex problems. By combining RAG's real-time context with RAFT's ability to "train" a model on how to better ignore irrelevant noise, teams can build agents that are significantly more accurate and reliable.

Those who invest in scalable architectures today will be best positioned to implement these advanced inference chains, where multiple LLM calls happen in sequence with optimal resource allocation. The shift toward agentic AI means that the quality of your processed data is more critical than ever, as agents rely on precise documentation to execute tasks on behalf of users. A robust foundation allows for these creative AI implementations to meet enterprise standards for consistency and safety.

Ultimately, the goal is to build a knowledgeable and satisfied audience by making information easy for these AI agents to grasp and act upon. Success in generative AI starts with making complex information accessible through an open-source foundation with enterprise governance. By establishing a robust, unified platform now, businesses can ensure their agentic initiatives deliver long-term value and foster trust with their users.


LinkConclusion

Success in generative AI starts with a commitment to high-quality data and scalable infrastructure. Red Hat OpenShift AI and Anyscale platforms provide the security and reliability needed to turn complex documents into actionable intelligence. By eliminating the data processing bottleneck with Ray Data and Docling, we allow organizations to focus on what matters most: solving real-world problems.

Ready to try Anyscale?

Access Anyscale today to see how companies using Anyscale and Ray benefit from rapid time-to-market and faster iterations across the entire AI lifecycle.