👉 “From theory to enterprise-ready architectures”
Introduction
In Chapter 2, we explored how far prompt engineering can take us: from zero-shot **instructions to **advanced reasoning strategies. These techniques maximize how we ask, but they cannot change what the model knows.
Even the best designed prompts are constrained by training data cutoffs, context window limits, and the risk of hallucinations when models lack access to current or domain specific information.
For scenarios requiring access to updated knowledge, proprietary databases, or specialized domain expertise, we need a different approach. Retrieval-Augmented Generation (RAG)
bridges this gap by connecting language models to external knowledge sources, enabling them to ground responses in current, verifiable information.
What is a RAG – Retrieval-Augmented Generation?
RAG is an approach that combines a generative model with an external information retrieval system. Instead of being limited to the knowledge learned during training, the model consults databases, documents, or APIs in real time and uses that information as context to generate more accurate, updated, and well-founded responses.
Main advantages:
- Avoids the complete retraining of the model.
- Allows incorporating recent or domain-specific data.
- Updates information that could be outdated or generic.
- Reduces the risk of hallucinations, by basing the responses on verifiable sources.
RAG architectural patterns
The RAG architectural patterns describe the different ways of organizing the flow between retrieval and generation. Each one adapts to different levels of complexity, needs, and usage contexts.
Simple RAG
- Definition: Retrieves documents based only on the vector similarity between the query and the stored documents. It is the most basic form of RAG.
- Use Case: FAQ bots, simple document search, applications without the need for extended context.
Simple RAG with Memory
- ✏️ Definition: Extends the Simple RAG by maintaining the context of previous interactions, which allows coherent responses throughout a conversation.
- 🛠How it works: Stores conversation history and includes relevant previous exchanges in the retrieval process, enabling follow-up questions and contextual understanding across multiple turns.
- 🎯 Best for: Conversational chatbots, customer service applications, educational tutoring systems, personal assistants that need to remember user preferences and context.
- ⚠️ Avoid when: Building stateless applications, handling sensitive data where conversation history poses privacy risks, or when memory storage costs outweigh benefits.
Branched RAG
- ✏️ Definition: Dynamically determines which data source(s) to query according to the nature of the request, allowing specialized access to multiple repositories.
- 🛠 How it works: Uses query classification to route different types of questions to appropriate knowledge bases (technical docs vs. marketing materials vs. legal documents), then combines results intelligently.
- 🎯 Best for: Enterprise environments with multiple document types, multidomain research platforms, applications serving diverse user roles with different information needs.
- ⚠️ Avoid when: Working with homogeneous data sources, simple single-domain applications, or when the overhead of routing logic exceeds the benefits of specialization.
HyDE (Hypothetical Document Embeddings)
- ✏️ Definition: Generates a hypothetical document as the initial representation of the query and uses it as a vector to retrieve more relevant real documents.
- 🛠 How it works: Instead of searching with the user's question directly, first generates a hypothetical answer, then uses that generated text as the search query to find documents with similar content patterns.
- 🎯 Best for: Complex research questions, scientific literature search, ambiguous queries where the question vocabulary differs significantly from answer vocabulary, exploratory research.
- ⚠️ Avoid when: Dealing with factual lookup queries, working with well-structured FAQs, or when response speed is critical since it requires an additional generation step.
Corrective RAG (CRAG)
- ✏️ Definition: Evaluates and validates the retrieved documents, correcting or complementing the information with reliable sources before passing them to the LLM.
- 🛠 How it works: Includes a quality assessment step that scores retrieved documents for relevance and accuracy, filtering out low-quality results and potentially triggering additional searches for better sources.
- 🎯 Best for: High-stakes applications where accuracy is critical (medical, legal, financial advice), fact-checking systems, journalism, regulatory compliance documentation.
- ⚠️ Avoid when: Building casual information systems, working with time-sensitive applications where validation overhead is prohibitive, or when source documents are already highly curated.
Graph RAG
- ✏️ Definition: Converts the retrieved information into a knowledge graph that captures structured relationships and entities, facilitating deeper reasoning.
-
🛠 How it works:
- Static Graph: Pre-constructs a knowledge graph from all available documents, storing entity relationships that can be queried directly
- Dynamic Graph: Builds graph connections on-demand during query processing, creating contextual relationships specific to the current question
- 🎯Best for: Financial investigations (tracking company ownership chains), legal research (connecting precedents and statutes), scientific literature (linking research across studies), business intelligence (understanding supplier networks).
- ⚠️ Avoid when: Dealing with simple factual questions, working with unstructured creative content, handling queries that don't require relationship understanding, or when graph construction overhead isn't justified.
Adaptive RAG
- ✏️ Definition: Dynamically selects the most appropriate retrieval strategy according to the complexity of the query.
- 🛠 How it works: Uses a routing mechanism that analyzes query complexity and selects retrieval strategy accordingly - simple questions use basic vector search, while complex multi-step questions might trigger graph traversal or multiple retrieval rounds.
- 🎯 Best for: General-purpose intelligent assistants, applications handling diverse query types, systems where user expertise levels vary significantly, adaptive educational platforms.
- ⚠️ Avoid when: Building specialized single-purpose applications, working with predictable query patterns, or when the complexity of multiple retrieval pipelines outweighs the benefits.
Self RAG
- ✏️ Definition: The model self-evaluates and corrects its responses through internal reflection before delivering them.
- 🛠 How it works: After generating an initial response, the model critiques its own output, checks for consistency with retrieved sources, and may regenerate or refine the answer based on this self-assessment.
- 🎯 Best for: Applications where accuracy is critical and errors are costly (medical diagnosis support, financial analysis, legal document review), high-quality content generation, research assistance.
- ⚠️ Avoid when: Building real-time applications where response speed is critical, working with simple queries that don't benefit from self-reflection, or when computational costs of multiple generation cycles aren't justified.
Agentic RAG (Multi-Agent RAG)
- ✏️ Definition: Employs multiple autonomous agents with memory and reasoning to divide complex queries, consult diverse sources, and synthesize results.
- 🛠 How it works: Different specialized agents handle different aspects of complex queries - one might search technical documents, another financial data, and a coordinator agent synthesizes their findings into a coherent response.
- 🎯 Best for: Complex research workflows, strategic business decision support, multi-domain analysis requiring expert knowledge from different fields, automated analytical processes.
- ⚠️ Avoid when: Handling simple queries, working with limited computational resources, building applications where the complexity of agent coordination exceeds the query complexity, or when interpretability of the decision process is crucial.
What are the best practices when building a RAG system?
An effective RAG requires a robust architecture that guarantees not only accurate responses, but also security, data quality, and the ability to evolve. Below, we explore the fundamental practices that distinguish an experimental RAG from one prepared for critical enterprise environments.
🔒 Security and compliance
Security in RAG systems goes beyond traditional encryption: it involves controlling what information can be indexed, who can access which responses, and how to guarantee regulatory compliance.
- Indexing control: not every document should enter the vector store (example: sensitive or confidential data).
- Contextual authorization: the same RAG can deliver different responses depending on the user profile.
- Encryption and regulations: apply standards such as GDPR, HIPAA, or other local regulations if personal or sensitive data are handled.
- Access auditing: record who consulted what information.
📊 Data quality
The quality of the data directly determines the usefulness of the system. Dirty data, incorrectly fragmented data, or outdated data will generate inaccurate responses, regardless of the sophistication of the model used.
- Prior curation: clean duplicated or noisy documents.
- Smart chunking: divide into semantic fragments, not arbitrary ones (paragraphs or complete sections instead of cuts by fixed tokens).
- Periodic updating: refresh embeddings when the data changes.
🗂️ Data versioning
Data versioning is fundamental because it is not only the model that matters, but also which external data were in force at each moment.
-
Tag versions
of documents, embeddings, and indexes. - Maintain
change history
(what was added, deleted, or modified). -
Ensure reproducibility
: be able to reconstruct how a response was generated. - Facilitate
audits
,traceability
, andregulatory compliance
.
👉 In regulated sectors (finance, health, legal), versioning allows demonstrating “what the system knew” when it responded.
🎯 Retrieval optimization
The quality of the responses depends on how well the system retrieves the most relevant documents. A poorly configured retrieval can generate inaccurate or irrelevant responses, even with the best LLM.
- Adjust the embeddings strategy (example: specialized models for code, health, finance).
- Use semantic re-ranking to improve the relevance of retrieved documents.
- Define an optimal k (number of documents returned) to avoid noise or lack of context.
🔄 Observability and monitoring
A RAG system in production requires full visibility of its behavior to detect problems before they affect end users.
- Record queries and responses to detect failures.
- Measure metrics such as recall, precision, latency, and user satisfaction.
- Monitor data drift (when embeddings stop properly representing the domain).
📈 Data updating
A RAG system is not static: documents change, knowledge bases grow, and information can become obsolete. Keeping the data fresh is essential.
- Periodic updates: reprocess documents and regenerate embeddings at defined intervals (daily, weekly, or monthly, depending on the domain).
- Real-time ingestion: in critical cases (news, stock market, security alerts), the pipeline must allow immediate indexing of new data.
- Removal of obsolete data: delete or deactivate expired versions to prevent the LLM from retrieving invalid information.
- Synchronization with source systems: connect APIs, databases, or CMS where the content comes from.
- Automation: use update jobs (batch or streaming) to reduce manual work.
🔮 What’s Next?
Once we understand how to build RAG systems and apply best practices, the next challenge is to evaluate their performance and adapt models efficiently. You can continue with the next chapter in this series: GenAI Foundations – Chapter 4: Evaluating AI – Can We Trust the Outputs?
📖 Series Overview
You can find the entire series on my Profile:
- ✏️ GenAI Foundations – Chapter 1: Prompt Basics – From Theory to Practice
- 🧩 GenAI Foundations – Chapter 2: Prompt Engineering in Action – Unlocking Better AI Responses
- 📚 GenAI Foundations – Chapter 3: RAG Patterns – Building Smarter AI Systems
- ✅ GenAI Foundations – Chapter 4: Model Customization & Evaluation – Can We Trust the Outputs?
- 🗂️ GenAI Foundations – Chapter 5: AI Project Planning – The Generative AI Canvas
📚 References
- Academy OpenAI. (2025, febrero 13). Advanced prompt engineering. https://academy.openai.com/home/videos/advanced-prompt-engineering-2025-02-13
- Anthropic. (s.f.). Creating message batches. Anthropic Documentation. https://docs.anthropic.com/en/api/creating-message-batches
- AWS. (s.f.). ¿Qué son los modelos fundacionales?. https://aws.amazon.com/es/what-is/foundation-models/
- AWS. (s.f.). ¿Qué es Retrieval-Augmented Generation (RAG)?. https://aws.amazon.com/es/what-is/retrieval-augmented-generation/
- Cloud Skills Boost. (s.f.). Introduction to generative AI. Google Cloud. https://www.cloudskillsboost.google/course_templates/536
- Google Developers. (s.f.). Ingeniería de instrucciones para la IA generativa https://developers.google.com/machine-learning/resources/prompt-eng?hl=es-419
- Google Developers. (s.f.). Información general: ¿Qué es un modelo generativo? https://developers.google.com/machine-learning/gan/generative?hl=es-419
- IBM. (s.f.). What is LLM Temperature?. https://www.ibm.com/think/topics/llm-temperature
- IBM. (s.f.). ¿Qué es el prompt engineering ? https://www.ibm.com/es-es/think/topics/prompt-engineering
- IBM. (s.f.). AI hallucinations. https://www.ibm.com/es-es/think/topics/ai-hallucinations
- Luke Salamone. (s.f.). What is temperature?. https://blog.lukesalamone.com/posts/what-is-temperature/
- McKinsey & Company. (2024-04-02). What is generative AI?https://www.mckinsey.com/featured-insights/mckinsey-explainers/what-is-generative-ai
- New York Times. (2025-05-08). La IA es cada vez más potente, pero sus alucinaciones son cada vez peores https://www.nytimes.com/es/2025/05/08/espanol/negocios/ia-errores-alucionaciones-chatbot.html
- Prompt Engineering. (2024-04-06). Complete Guide to Prompt Engineering with Temperature and Top-p https://promptengineering.org/prompt-engineering-with-temperature-and-top-p/
- Prompting Guide. (s.f.). ReAct prompting. https://www.promptingguide.ai/techniques/react
- Prompting Guide. (s.f.). Consistency prompting. https://www.promptingguide.ai/techniques/consistency
- Learn Prompting. (2024-09-27). Self-Calibration Prompting https://learnprompting.org/docs/advanced/self_criticism/self_calibration
- AI Prompt Theory. (2025-07-08). Temperature and Top p: Controlling Creativity and Predictability https://aiprompttheory.com/temperature-and-top-p-controlling-creativity-and-predictability/?utm_source=chatgpt.com
- Vellum. (s.f.). How to use JSON Mode https://www.vellum.ai/llm-parameters/json-mode?utm_source=www.vellum.ai&utm_medium=referral
- OpenAI. (2025-08). What are tokens and how to count them?. https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them
- Milvus.(s.f.) What are benchmark datasets in machine learning, and where can I find them?. https://milvus.io/ai-quick-reference/what-are-benchmark-datasets-in-machine-learning-and-where-can-i-find-them
Top comments (0)