My team built a RAG agent on only our codebase and Confluence docs—and it quickly became clear that the “magic” was not in calling the LLM, but in getting embeddings, chunking, and prompts to behave reliably. I'll be sharing our experience in a practical guide for doing internal‑only RAG.
Context: RAG Without the Internet
In this project, the goal was to build an RAG agent that answered questions only from two internal sources:
Markdown files in the monorepo (architecture docs, READMEs, ADRs, service guides).
Confluence pages with product and system documentation.
External web data was intentionally excluded so the agent would not:
- Mix internal terminology with conflicting public definitions (e.g., “shell”, “patterns”, “components” meaning different things internally vs online).
This strict boundary is common in enterprise settings: legal, compliance, and domain specificity often require “sealed” knowledge bases instead of open‑web RAG.
Architecture at a Glance
The system followed a fairly standard RAG structure, but the constraints made every design choice matter more:
-
Sources
- Git repo markdown files (e.g.,
/services/billing/README.md, /docs/architecture/adr-004.md). - Confluence pages (ADR, framing, documentations).
- Git repo markdown files (e.g.,
-
Ingestion & preprocessing
- Loaders for MD files and Confluence pages.
- Chunking into sections (initially naïve, later more structure‑aware).
-
Indexing
- Embedding each chunk into a vector store.
- Storing metadata: repo path, service name, Confluence space, page title, headings.
-
Query pipeline
- User question → retrieve top‑k chunks → construct prompt → call LLM → answer restricted to retrieved context.
On paper, this looked like “just another RAG setup.” In practice, two things made it much harder: small changes in embeddings and small changes in prompts produced dramatically different behavior.
Problem #1: Embeddings Felt Mysterious and Fragile
Once the initial pipeline was running, tweaking the embedding setup led to surprisingly large differences in results:
Changing the embedding model changed which chunks were retrieved for the same query.
Adjusting chunk size or overlap shifted context in or out of the retrieved set.
Normalization or filtering decisions sometimes made the agent feel “too fuzzy” (semantically related but wrong section) or “too literal” (ignoring relevant but slightly phrased differently).
From a developer’s perspective, it often felt like “there’s so much under the hood I don’t know about.” A minor config change could silently change behavior across the system, without obvious visibility.
Why does this happen (conceptually)
Even without reproducing formal research, it helps to internalize a few principles:
Embeddings encode similarity in a high‑dimensional space; changing models or text preprocessing rearranges that space.
Chunk boundaries matter: if a definition is split across chunks, a query may retrieve only half the context, making answers vague or wrong.
Without a small evaluation set, it is easy to mistake random changes in ranking for “improvements.”
Instead of treating embeddings as a black box, the lesson is to treat them as a design surface that must be tested and versioned like code.
Problem #2: Prompt Instructions Were Shockingly Sensitive
The second major pain point was prompt sensitivity:
Slight rewrites of the system prompt (“answer strictly from context” vs “use the following information as a primary source”) changed how aggressively the model hallucinated or interpolated missing details.
Minor tweaks to how the retrieved context was formatted (e.g., bullet lists vs plain text, including/excluding file paths) led to different focus in answers.
Sometimes the model ignored obviously relevant retrieved chunks because the prompt framing didn’t clearly instruct it how to use them.
This matched a pattern emerging in many RAG efforts: LLM behavior is not only a function of retrieved content, but also of how that content is presented and constrained in the prompt.
Without a clear evaluation, these changes felt like an enigma: change a sentence, run a query, hope for the best.
Design Principle #1: Make the Knowledge Boundary Explicit
The first key insight from this project is that “no external data” is a feature, not a bug, and the entire pipeline should be designed around that constraint.
Practical moves you can apply
-
Encode the boundary in the system prompt:
- Tell the model explicitly that internal definitions override any general knowledge.
- Instruct it to respond with “I don’t know” if the answer is not present in the retrieved docs.
Example (pseudo‑prompt):
You are an internal assistant for the ACME engineering team.
Use only the information in the provided documents to answer.
Our internal definitions may differ from the internet.
If the answer is not in the documents, reply: “I don’t know based on our documentation.”
-
Ensure chunk metadata reinforces the boundary:
- Include fields like source=confluence or source=repo, service=billing, space=Engineering, so you can filter or rerank based on these internal structures.
This explicitly encodes what the team was informally trying to achieve: prevent cross‑contamination between the open web and internal domain language.
Design Principle #2: Treat Chunking and Embeddings as a Product Feature
The second insight is that chunking and embeddings should be treated as first‑class design decisions, not low‑level implementation details.
Better chunking for MD + Confluence
For markdown and Confluence, structure‑aware chunking can dramatically increase retrieval quality:
Chunk by headings and sections, not arbitrary tokens.
Keep code blocks with the explanatory text that references them.
Use overlapping windows to avoid cutting definitions or procedures in half.
Concretely:
For narrative docs and ADRs: 400–800 token chunks with 10–15% overlap.
For API or configuration docs: smaller chunks (200–400 tokens) grouped by heading or endpoint.
This makes retrieval more predictable: queries align with logical sections (e.g., “component design conventions”) instead of random text windows.
Stabilize embeddings with simple rules
Instead of constantly tweaking the embedding function ad hoc:
Pick a single embedding model as a default baseline and stick to it until you have evidence that you need to change.
-
Standardize your preprocessing:
- Normalize whitespace.
- Strip obvious boilerplate (navigation, repeated headers, footers).
- Preserve headings and code fences.
-
Version your embedding index:
- Treat “embedding config v1.0.0”, “v1.0.1” like database migrations.
- When you change the model or chunking, treat it as a new index, test it against a fixed evaluation set, then switch over.
This converts “tiny change, huge surprise” into “deliberate update with a predictable impact.”
Design Principle #3: Build a Tiny, Ruthless Evaluation Set
The biggest regret in many RAG projects is not starting with a small, curated evaluation set. In this project, the team felt the impact of changes but didn’t have a systematic way to measure them.
To avoid that:
Collect 10–30 real internal questions from engineers, product, and support.
-
For each question, manually identify:
- The key Confluence pages and MD files that should appear in the top‑k results.
- A sample “good enough” answer.
Then, whenever you change:
Embedding model or parameters.
Chunking strategy.
Prompt template.
Run the eval set and track:
How often the correct doc appears in the top‑k (retrieval hit rate).
Whether the answer is: correct, partially correct, incorrect, or “I don’t know.”
Even a spreadsheet with “before vs after” is enough to turn guesswork into data‑driven iteration. This is where thought leadership comes from: not “here’s a pretty diagram,” but “here’s how we actually validated that our RAG changes helped.”
Design Principle #4: Make Prompts Boring and Versioned
Prompt experimentation can become chaotic quickly. The key is to make prompts boring, consistent, and versioned.
Practical approach
-
Define a small number of prompt templates:
- prompt_v1: initial template.
- prompt_v2: stricter “use only context” version.
- prompt_v3: version tuned for debugging (“cite file paths and headings”).
Log which prompt version is used for each query.
-
When exploring new instructions:
- Test them against the evaluation set.
- Compare outputs and metrics before making the new prompt the default.
Example upgrade path:
Initial prompt: “Answer the user’s question based on the following context.”
Improved prompt:
Explicitly call out internal definitions and the “I don’t know” behavior.
Ask the model to highlight uncertainties.
- Expert prompt:
Ask the model to reference specific file paths or Confluence pages where it found the answer.
This encourages answers tied to concrete sources and helps developers debug retrieval issues.
This pipeline creates a culture where prompts are artifacts that can be rolled forward and back, not one‑off experiments that silently alter system behavior.
Design Principle #5: Use Structure and Metadata, Not Just Vectors
In internal codebases and Confluence, the structure often reflects the domain model better than any embedding:
Services: billing-service, auth-service, notifications-service.
Pages: “Engineering”, “Product”.
Tags or labels: feature-flag, incident, ADR, docs.
Using this structure in retrieval can make the system more robust:
Filter by service when the query includes a service name.
Boost documents marked as ADRs when the query sounds like a design decision.
Prefer incident postmortems when the query references “outage”, “incident”, or “postmortem”.
This hybrid of metadata filtering + semantic similarity often outperforms pure embeddings, especially when terminology is overloaded internally.
Lessons Learned from Building Internal RAG
From this project, several lessons stand out that are broadly applicable:
Control of the knowledge boundary is strategic. Saying “only MD + Confluence” forces clarity about definitions and guards against subtle domain drift.
Embeddings and chunking are not plumbing. They are product‑defining choices that deserve design, metrics, and versioning.
Prompt changes must be testable. Treat prompts like code: version them, evaluate them, and roll back when needed.
A small evaluation set is worth more than endless manual testing. Ten good questions with expected docs can save weeks of blind tweaking.
Structure beats cleverness. Using repo paths, spaces, and tags often yields more reliable retrieval than yet another embedding model change.
Top comments (0)