Most discussions about retrieval-augmented generation (RAG) focus on choosing the right model, tuning prompts, or experimenting with vector databases. In practice, these are rarely the hardest parts. The real bottleneck appears much earlier: getting clean, reliable text out of messy documents.
There is a real challenge in ingestion, chunking, and embeddings. PDFs preserve visual layout rather than logical structure, Office files rely on completely different internal formats, and scanned documents require OCR before any text exists at all. Metadata is often incomplete or inconsistent, and small problems at this stage propagate downstream. If the extraction quality is poor, retrieval becomes unreliable, and the language model begins to produce weak or misleading answers.
This is where Kreuzberg plays a central role, covering the entire early-stage data flow: document ingestion, text chunking, and embedding generation. A typical RAG pipeline can combine Kreuzberg for ingestion, chunking, and embeddings with LangChain as the orchestration layer, alongside a vector database and an LLM. While the architecture is fairly standard, the quality of the early steps determines everything that follows.
Embeddings are numerical vector representations of text. An embedding model converts a piece of text, such as a sentence, paragraph, or document, into a list of numbers that captures its semantic meaning. Texts with similar meanings end up close to each other in this high-dimensional vector space, making it possible to search by meaning rather than exact keywords. If you haven’t seen this before, the TensorFlow Embedding Projector is a useful way to visualize how embeddings cluster similar concepts together.
Here are the steps to a RAG pipeline with Kreuzberg and LangChain:
- Extract text from a sample PDF and DOCX using Kreuzberg
- Inspect the raw output and metadata to understand what high-quality extraction looks like
- Chunk the text using a concrete strategy (recursive splitting with overlap) with Kreuzberg
- Generate embeddings with Kreuzberg and store them in a vector database such as Chroma or FAISS
- Wire everything together with LangChain and run a query end-to-end
In the examples, we'll use Kreuzberg Python.
Begin by installing dependencies.
pip install kreuzberg langchain chromadb
Then, extract text from your document.
from kreuzberg import extract
# Extract from a PDF
pdf_result = extract("sample.pdf")
# Extract from a DOCX
docx_result = extract("sample.docx")
print(pdf_result.text[:500])
print(pdf_result.metadata)
At this stage, you receive:
Clean extracted text
Structured metadata
Page-level and document-level information
After that, chunk the extracted text. Instead of manually splitting strings, use Kreuzberg’s built-in chunking configuration.
from kreuzberg import extract, ChunkingConfig
result = extract(
"sample.pdf",
chunking=ChunkingConfig(
strategy="recursive",
chunk_size=500,
chunk_overlap=50
)
)
# Access generated chunks
for chunk in result.chunks[:3]:
print(chunk.content)
print(chunk.metadata)
Embeddings with Kreuzberg are the next step.
from kreuzberg import extract, ChunkingConfig, EmbeddingConfig
result = extract(
"sample.pdf",
chunking=ChunkingConfig(
strategy="recursive",
chunk_size=500,
chunk_overlap=50
),
embedding=EmbeddingConfig(
preset="sentence-transformers/all-MiniLM-L6-v2"
)
)
# Each chunk now contains an embedding vector
first_chunk = result.chunks[0]
print(len(first_chunk.embedding)) # vector dimension
Store embeddings in a Vector Database (for example, Chroma)
import chromadb
from chromadb.config import Settings
client = chromadb.Client(Settings(anonymized_telemetry=False))
collection = client.create_collection("documents")
for chunk in result.chunks:
collection.add(
documents=[chunk.content],
metadatas=[chunk.metadata],
embeddings=[chunk.embedding],
ids=[chunk.id]
)
And query with LangChain. LangChain orchestrates retrieval and generation.
from langchain.vectorstores import Chroma
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA
vectorstore = Chroma(
collection_name="documents",
embedding_function=None # embeddings already computed
)
retriever = vectorstore.as_retriever()
llm = ChatOpenAI(model="gpt-4o-mini")
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
retriever=retriever
)
response = qa_chain.run("What is this document about?")
print(response)
LangChain connects:
The retriever (vector database)
The prompt template
The LLM
The final response pipeline
What You Just Built
You now have:
Document ingestion (Kreuzberg)
Structured chunking (Kreuzberg)
Embedding generation (Kreuzberg)
Vector storage (Chroma)
Retrieval orchestration (LangChain)
Answer synthesis (LLM)
This is a complete, production-ready RAG pipeline.
Why Document Processing Can Be the Hardest Part of RAG
Many tutorials focus heavily on embeddings and prompting, but teams that deploy real systems quickly discover that data preparation is the bottleneck. Production pipelines must deal with complex layouts, multiple file formats, scanned documents, large batches, and multilingual content.
Kreuzberg is designed specifically for this layer. It transforms heterogeneous documents into clean, structured outputs that downstream systems can reliably use. In a typical RAG pipeline, Kreuzberg sits at the beginning, extracting text, structuring metadata, chunking content, and generating embeddings in a consistent and unified way.
A useful way to visualize the flow is as a sequence of transformations: documents are extracted, divided into smaller segments, converted into embeddings, stored in a vector database, retrieved in response to a query, and finally synthesized by a language model. Every stage depends on the quality of the one before it.
The Architecture of a RAG Pipeline
Although implementations differ, most pipelines follow the same logical progression. Documents are first ingested and normalized. The extracted text is then split into chunks of manageable size, after which embeddings are generated and stored in a searchable index. When a user asks a question, the system retrieves the most relevant chunks and passes them to an LLM for synthesis.
One of the strengths of the RAG pattern is that each stage can be swapped independently. The ingestion engine, embedding model, database, and LLM can all be replaced without redesigning the entire system. Keeping these concerns separated makes pipelines easier to evolve.
Extracting Text from Documents
The first stage is always extraction. In practice, this involves reading files in multiple formats, detecting whether text is embedded or must be recovered through OCR, and preserving structural or metadata information whenever possible.
After this step, the system has clean text, document metadata, and often page-level or structural information. This output becomes the foundation for everything that follows, and in Kreuzberg’s case, it directly feeds into chunking and embedding generation.
Chunking and Embeddings
Once text has been extracted, it must be divided into smaller segments. Large documents cannot be embedded or retrieved efficiently as a single block. The goal of chunking is not only to reduce size but also to preserve meaning. Splitting in the wrong place can destroy context and reduce retrieval accuracy.
This step is especially critical because the semantic models used in RAG systems are designed to capture relationships across sequences of text. Many models effectively learn patterns in both directions, allowing them to understand context beyond individual tokens. The way text is chunked directly affects how well these relationships are preserved in the resulting embeddings.
After chunking, each segment is converted into a vector representation. At this point, each chunk becomes a structured record consisting of text, metadata, and an embedding vector. Kreuzberg handles both chunking and embedding generation, reducing complexity and ensuring consistency across the pipeline.
Retrieval and Answer Generation
When a user submits a query, the pipeline converts it into an embedding and searches the vector database for similar entries. In practice, this means finding the chunks whose representations are closest to the query in semantic space.
Frameworks like LangChain orchestrate this process, connecting retrieval, prompting, and generation into a single workflow. They also make it possible to refine retrieval, for example, through filtering, ranking, or hybrid search, so that the most relevant context is passed to the language model.
An important detail is that the model never sees the entire dataset. It only receives a carefully selected subset of chunks. The quality of this selection determines the quality of the final answer.
Scaling a RAG Pipeline
Once a pipeline works on a small dataset, real-world deployments introduce additional requirements. Ingestion must handle large volumes of files and often run in parallel. Retrieval systems benefit from metadata filtering and hybrid search strategies, and generation layers often include structured prompts or citation mechanisms.
At scale, another challenge emerges: as data grows, it becomes increasingly difficult to understand or navigate the information at all. Large document collections quickly exceed what humans can manually organize or search effectively. This is exactly where RAG systems become so important: they make massive, unstructured datasets usable.
Common Mistakes
One of the most frequent mistakes is treating ingestion as a trivial preprocessing step. Teams often invest heavily in prompt engineering while overlooking extraction quality, only to discover that retrieval accuracy is limited by poor source data. Inconsistent chunking and missing metadata create similar issues.
A good rule of thumb is to design this early stage carefully. Because extraction, chunking, and embedding happen at the beginning, mistakes here propagate forward. Poor extraction leads to weaker chunking, lower-quality embeddings, less accurate retrieval, and ultimately worse answers.
Final Thoughts
RAG systems succeed or fail based on the quality of their data pipeline. Reliable document parsing, chunking, and consistent embedding generation form the foundation on which retrieval and generation depend.
Kreuzberg fits naturally into this architecture because it addresses the first part of the workflow: turning messy, real-world documents into clean, structured, and semantically meaningful data ready for retrieval and generation. LangChain provides the glue between components, letting you compose retrieval, prompts, and LLMs into a single, production-ready pipeline.
Don't hesitate to submit issues or make contributions to Kreuzberg on GitHub.
Top comments (0)