Ali Ismail

Posted on Oct 26

Why Your RAG System Hallucinations Start at Ingestion, Not the LLM

#rag #vectordatabase #llm #agents

Most teams are busy optimizing prompts, but the silent bottleneck is poor ingestion

LLMs are only as good as what you feed them.
Make sure that your team isn't feeding it junk.
The ingestion pipeline is unglamorous. It's the data processing layer between documents and your vector database. However, it is the biggest factor on whether your AI answers are accurate, relevant, and cost efficient. Not your prompts, retrieval strategy, or your chosen model.
Fix ingestion first, or watch everything downstream fall apart.

Hidden Problems with Poor Ingestion

Improper ingestion of documents in VectorDBs leads to:

Bad Retrieval: irrelevant chunks or missing context.
High costs: thousands of unnecessary embeddings ("chunk explosion")
Slow queries: poor indexing or overlapping data
Stale knowledge: no versioning or re-embedding strategy

When embedding is not optimized, you're facing major flaws in your AI solution:

Every interaction costs more money than it should
Hallucination rates go up
AI quickly turns into an echo chamber

Classic Ingestion Demo Loop

docs = []
for root, _, files in os.walk(doc_dir):
    for file in files:
        path = os.path.join(root, file)
        if file.endswith((".txt", ".md")):
            docs.extend(TextLoader(path, encoding="utf-8").load())
        elif file.endswith(".pdf"):
            docs.extend(PyPDFLoader(path).load())

splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=50)
splits = splitter.split_documents(docs) if docs else []

embeddings = HuggingFaceEmbeddings(model_name=settings.embeddings_model)
db = Chroma(persist_directory=chroma_dir, embedding_function=embeddings)
db.add_documents(splits)
db.persist()

This code snippet will work in a demo but fall apart in production.

What This Code Doesn't Handle

Deduplication:

If you run this code twice, you'll have duplicate embeddings. Uploading updated documents will compete will older versions during retrieval.
VectorDBs (like Chroma, Pinecone, Qdrant) don't automatically detect duplicates. If you re-run ingestion on the same folder, every file gets re-embedded and appended. This creates multiple versions of the same content.

Re-embedding Strategy:

Without a tracking system, if your embedding model gets updated or your document content changes, there's no way to know what needs to be re-embedded or why.
Embeddings are model-dependent. If you switch from text-embedding-ada-002 to text-embedding-3-small, or even upgrade your HuggingFace model, old vectors are no longer compatible because cosine similarities shift.

Metadata Versioning

Knowing which version of the document a chunk is from, when it was ingested, who authored it is paramount to knowing if the data is still valid.
Without metadata, the VectorDB turns into a black box. No one can explain why a chunk was retrieved, whether it's outdated, or who authored it. All of which is critical for compliance and debugging.

Dynamic Chunk Sizing

Strict character limits fall apart when ingesting tables, code blocks and lists. Semantic boundaries are more important than arbitrary character counts.
Fixed length chunking ignores structure. If the data retrieved was in them middle of a code block, table, or conversation, a Model will not be able to reconstruct the meaning. The retrieval will lead to incomplete or surface level thoughts.

Semantic Validation of Retrievals

It's vital to know if the chunking preserved meaning and that relevant information can be retrieved. In the absence of testing, metrics, and a feedback loop, the project flies blind.
Consistent testing is required to confirm that the ingestion pipeline has been successful upon embedding. The team needs to validate that adding chunks is hurting or helping.
Each of these needs to be considered in order to ensure quality and reliability in your AI assistant.

What Healthy Ingestion Unlocks
When done right you get more

Retrieval accuracy: Relevant and confident answers
Cost efficiency: Fewer embeddings and fewer tokens
Performance: Faster queries and smaller indexes
Knowledge freshness: re-embedding keeps responses up to date
Compliance ready: traceable, auditable data lineage.

The Healthy Ingestion Lifecycle
A well designed ingestion pipeline typically follows this sequence:

Extract - Load docs, normalize text, attach metadata
Chunk - split semantically, not blindly by size
Embed - use model fit for your domain
Validate - test retrieval precision and recall with sample questions
Version & Monitor - detect drift, re-embed, and track growth

Each stage can be measured with metrics like cost and accuracy

If AI is behaving dumb, it may be because has an ingestion problem.
AI is only as sharp as what it eats.
Healthy ingestion is like a healthy diet.

Documents are ingredients
Chunking is portion control
Embeddings are digestion

When AI eats junk food, it gets bloated, sluggish, and confused.
Feed it clean and structured data, and it gets lean, responsive, and smart.

DEV Community