DEV Community: Boussaden Taha

RAG - Complete Practical Guide

Boussaden Taha — Sat, 16 May 2026 11:08:49 +0000

Introduction

Retrieval Augmented Generation, is one of the biggest pillars in todays AI field. Mainly used by big companies for better internal gestion and retrieval of documents.
In this article I will be explaing some RAG concepts with code snippets for a better grasp, and also be talking about some common problems I faced when implementing my own RAG, and presenting some solutions all along.

What is RAG?

RAG (Retrieval Augmented Generation) is a system design pattern that combines:

Information retrieval (finding relevant knowledge)
Large Language Models (LLMs) (generating responses)

Instead of relying only on what the model had learned during training, a RAG system retrieves external knowledge and injects it into the prompt.

Traditional LLM

Question
   ↓
Model Memory (Training Data)
   ↓
Answer

Problem:

knowledge can be outdated
hallucinations happen
cannot access private company data

RAG based LLM

Question
   ↓
Retrieve Relevant Knowledge
   ↓
Add Context to Prompt
   ↓
LLM Generates Grounded Answer

This makes answers:

more accurate
grounded in documents
customizable
domain-specific

Why RAG?

LLMs are powerful but limited.

Common problems:

1. Hallucinations

The model invents facts.

Example:

Question:
Who founded Company X?

Answer:
John Smith.

Even if John Smith never existed.

2. Knowledge Cutoff

Models only know what they were trained on.

They do not automatically know:

your PDFs
internal documentation
GitHub repositories
recent updates

3. Private Data

Businesses need AI over:

internal docs
policies
tickets
codebases

RAG solves this.

Core Architecture

A RAG system usually contains:

Documents
Chunking system
Embedding model
Vector database
Retriever
Prompt constructor
LLM

Architecture:

Documents
   ↓
Chunking
   ↓
Embeddings
   ↓
Vector Database

User Question
   ↓
Question Embedding
   ↓
Similarity Search
   ↓
Relevant Chunks
   ↓
Prompt Construction
   ↓
LLM
   ↓
Answer

How RAG Works Step by Step

1. Documents

The system starts with raw documents.

Examples:

TXT files
PDFs
Markdown files
HTML pages
GitHub repos

Example text:

RAG systems use vector databases to retrieve
relevant information for LLMs.

2. Chunking

Documents are split into smaller sections.

Why?

Embedding entire books is ineffective.

Instead:

Large Document
   ↓
Small Chunks

Example:

Chunk 1 → Intro
Chunk 2 → Embeddings
Chunk 3 → Pinecone

3. Embeddings

Every chunk becomes a vector.

Example:

"RAG systems use retrieval"

becomes:

[0.12, -0.77, 0.48, ...]

4. Store in Vector Database

Vectors are stored in:

Pinecone
Weaviate
Qdrant
Chroma
FAISS

5. User Question

Example:

What are embeddings?

Question becomes a vector too.

6. Similarity Search

The vector database finds:

Most similar chunks

based on mathematical similarity.

7. Prompt Construction

Retrieved chunks are injected into prompt.

Example:

Context:
Embeddings are vector representations.

Question:
What are embeddings?

8. LLM Generation

The LLM generates an answer using retrieved context.

Key Concepts and Definitions

1. Embedding

A numerical semantic representation of text.

Example:

"Machine learning"
↓
[0.12, -0.34, ...]

Purpose:

semantic understanding
similarity search

2. Vector

An ordered list of numbers.

Example:

[0.12, -0.55, 0.91]

3. Dimension

The number of values inside a vector.

Example:

768-dimensional vector

means:

768 numbers

Why it matters:

Your vector DB dimension must match embedding dimension.

Example:

nomic-embed-text → 768
Pinecone index → must be 768

4. Semantic Search

Search by meaning.

Not exact keywords.

Example:

Question:

How does memory work?

Can retrieve:

Agents retain context using memory systems.

5. Similarity Score

Measures closeness between vectors.

Higher score:

More relevant

Top-K

How many results to retrieve.

Example:

top_k=5

Means:

Return best 5 chunks

6. Metadata

Extra information attached to vectors.

Example:

{
  "text": "Embeddings are vectors",
  "source": "notes.txt",
  "topic": "rag"
}

Embeddings Explained

Embeddings convert text into mathematical meaning.

Texts with similar meanings end up close together.

Example:

"How to build AI agents"

and

"Creating autonomous agents"

become nearby vectors.

Generating Embeddings with Ollama

import ollama


def generate_embedding(text):
    response = ollama.embeddings(
        model="nomic-embed-text",
        prompt=text
    )

    return response["embedding"]

Test:

embedding = generate_embedding(
    "What is RAG?"
)

print(len(embedding))
print(embedding[:10])

The code snippets seen above are from a RAG project I implemented, you can view the source code here

Vector Databases

A vector database stores embeddings.

Traditional DB:

Search by exact values

Vector DB:

Search by similarity

Common vector DBs:

Pinecone
Qdrant
Weaviate
Chroma
FAISS

Chunking

Chunking is splitting documents.

1. Why Chunking Matters

Bad chunking = bad retrieval.

Example problem:

Chunk 1:
RAG systems use semantic

Chunk 2:
search through vectors

Meaning gets broken.

2. Character-Based Chunking

def chunk_text(text,
               chunk_size=800,
               overlap=150):

    chunks = []
    start = 0

    while start < len(text):

        end = start + chunk_size

        chunk = text[start:end]
        chunks.append(chunk)

        start += chunk_size - overlap

    return chunks

3. Overlap

Preserves context.

Example:

Chunk 1 → 0-800
Chunk 2 → 650-1450

Overlap:

150 characters

Similarity Search

Pinecone compares vectors.

Usually using:

Cosine Similarity

Measures angle similarity.

Similar meaning:

High cosine score

Retrieval Pipeline

Example retrieval:

query_embedding = generate_embedding(query)

results = index.query(
    vector=query_embedding,
    top_k=5,
    include_metadata=True
)

Explanation:

vector=query_embedding

Search using question vector.

top_k=5

Retrieve top 5 results.

include_metadata=True

Return original chunk text.

Prompt Augmentation

This is the "augmentation" in RAG.

We inject context.

Example:

context = "\n\n".join(
    match["metadata"]["text"]
    for match in results["matches"]
)

Prompt Example

prompt = f"""
You are a helpful assistant.

Answer ONLY using the context.

Context:
{context}

Question:
{query}

Answer:
"""

Generation Phase

Send prompt to the LLM.
For me, I used my local LLM Mistral

response = ollama.chat(
    model="mistral",
    messages=[
        {
            "role": "user",
            "content": prompt
        }
    ]
)

print(response["message"]["content"])

Pinecone Concepts

Below are some Pinecone concepts I used and hope you might find helpful.

1. Index

Container of vectors.

Equivalent to:

Database table

2. Creating Index

from pinecone import Pinecone

pc = Pinecone(api_key=API_KEY)

pc.create_index(
    name="rag-demo",
    dimension=768,
    metric="cosine",
    spec={
        "serverless": {
            "cloud": "aws",
            "region": "us-east-1"
        }
    }
)

3. Upsert

Insert/update vectors.

index.upsert(vectors=vectors)

4. Query

Search vectors.

index.query(...)

5. Delete

Delete vectors.

index.delete(delete_all=True)

Metadata in RAG

Store useful context.

Example:

metadata={
    "text": chunk,
    "source": "notes.txt",
    "section": "embeddings"
}

Useful later for:

filtering
citations
debugging

Best Practices

These are some best practices to follow when building your RAG system:

Retrieval quality > model quality
Use metadata
Keep chunks meaningful
Avoid tiny chunks
Re-index after document updates
Use overlap
Start simple before frameworks
Debug retrieval separately from generation

However there is some considerations, as real production RAG systems often add features not present in my personal simple RAG system, such as:

authentication
streaming
caching
citations
reranking
hybrid search
observability
evaluation pipelines
vector versioning
document syncing

Glossary

Term	Meaning
RAG	Retrieval-Augmented Generation
Embedding	Numerical representation of text
Vector	Ordered list of numbers
Dimension	Number of values in vector
Chunk	Small document section
Metadata	Extra vector information
Top-K	Number of retrieved results
Similarity Search	Finding closest vectors
Cosine Similarity	Vector closeness metric
Index	Pinecone vector collection
Upsert	Insert/update vector
Retrieval	Finding relevant knowledge
Generation	Producing final answer
Hallucination	Fabricated answer
Reranking	Reordering retrieved chunks
Hybrid Search	Semantic + keyword retrieval

Conclusion

Dear reader, I hope my POV of RAGs helped you even a little bit to understand how these systems work under the hood from embedding to retrieving to generating the proper response.
And this is the essence of a RAG system.

Agents

Boussaden Taha — Sat, 09 May 2026 09:17:15 +0000

Introduction

This article is my point of view on agents with a technical deep dive on them. I'll be sharing my journey on how I built a working AI agent from scratch, decomposing every component and discussing the trade offs, the latency, the cost and reliability all along.
My goal is to make this deterministic system that's wrapped around a probabilistic core explicit.

Defining an AI Agent

Before building anything, we need to draw hard boundaries between three commonly conflated systems, which are scripts, chatbots and finally agents.

Assumptions

You should be comfortable with JavaScript (async/await, APIs), Basic HTTP concepts and JSON data structures.

Scripts (Deterministic Program)

A script is a just fixed mapping:

    y=f(x)

Same input → same output
No adaptation
No internal state beyond execution context for example:

    function classify(input){
      if(input.includes("error")) return "bug";
      return "general";
    }

With no notion of iteration, decision under uncertainty or external tool usage.

Chatbots (Single-Step LLM System)

A chatbot introduces probabilistic behavior:

y∼p(y∣x)

The output is generated from a probability distribution, so still just a single step with no iterative reasoning loop and no explicit action execution.

for example:

    const response = await llm("Explain recursion simply");

Even with conversation history, this remains a mapping, not a system, with no persistent goal tracking and no structured interaction with the environment.

Agent (Iterative, Stateful System)

An agent is fundamentally different:

    at∼π(a∣st),st+1=f(st,at,rt)

Mathematical Term	Meaning	Code Representation
( st )	Current state	`state` object
( at )	Chosen action	`action` JSON
( rt )	Tool execution result	`result`
( \pi )	Policy (decision model)	`llm()` function
( f )	State transition	`updateState()`

    async function step(state, memory){
      const action = await policy(state, memory);   // at
      const result = await execute(action);         // rt
      const nextState = updateState(state, result); // s_{t+1}

      return { nextState, action, result };
    }

It's iterative with multi-step execution, stateful maintaining memory across stepsand action-oriented interacting with tools/environment

for example:

    while(!done){
        const action = decide(state);
        const result = act(action);
        state = update(state, result);
    }

This loop is the defining feature, without it you just have a wrapper around an API.

Overview

I myself, just began learning about AI agents, and this is my very first small one, tinybot:

    import { groq } from '@ai-sdk/groq';
    import { generateText } from 'ai';

    const model = groq('llama-3.3-70b-versatile', {
      apiKey: 'groq api key goes here',
    });

    const { text } = await generateText({
      model: model,
      system: 'Answer everything in exactly 3 words.',
      prompt: 'What is the meaning of life?',
    });

    console.log(text);

tinybot response:

    taha@192 tinybot % node tinybot.js                                                               
    Find True Happiness

Why Most AI Agent Tutorials Fall Short

For me, I think most tutorials treat AI agents as black boxes, thus creating an over abstraction because of the reliance on frameworks that hide core mechanics.
Like this:

    const agent = new Agent({...});
    await agent.run();

And as a result many or some cannot debug failures, extend functionality and reason about performance.

A More Precise View

At its core, an AI agent can be modeled as a discrete-time control system.

At each time step t, the agent:

Observes a state st
Chooses an action a
Receives a result rt
Transitions to a new state st+1

We can express this formally:

    st+1=f(st,at,rt)

Where:

st = current state (input + memory)
at = action chosen by the agent
rt = result of executing the action
f = state transition function

State Representation (st)

State is the most underexplained part of agent systems.
Formally, it is everything the agent conditions on:

    st=(x,mt,ht)

Where:

x = current input
mt= memory (retrieved knowledge)
ht= interaction history

example:

    const state = {
      input: "Find a good fishing rod under $1000",
      memory: [...retrievedDocs],
      history: [...previousSteps]
    };

Key insight:

The LLM never “sees” your system, only the serialized state you provide; meaning Bad state design = bad decisions.

Deterministic System, Probabilistic Core

An important distinction is that an agent system (loop, tools, memory) is deterministic and the policy (LLM) is probabilistic.
We can think of the full system as:

    Deterministic Runtime + Probabilistic Policy = AI Agent

Or more formally:

    Agent=Runtime(π,T,M)

Where:

π = policy (LLM)
T = set of tools
M = memory system

Why This Matters

This framing is not academic, it just impacts how you build systems, for example if you don’t control st, the agent behaves unpredictably, if you don’t constrain at, the agent may hallucinate actions, and if f is poorly designed, the system becomes unstable.

Stateless vs Stateful Systems

Stateless

Stateless, means each decision is independent:

at∼π(a∣x)

No memory
No accumulation of knowledge
Limited reasoning depth

Stateful

Decisions depend on history:

at∼π(a∣st)

Enables multi-step reasoning
Allows correction and refinement
Introduces complexity (memory growth, noise)

Code Comparison

Stateless:

await llm("Summarize this article");

Stateful:

    await llm(buildPrompt({
      input,
      history,
      retrievedMemory
    }));

From Theory to Execution: Full Step Trace

Let’s walk one iteration concretely:

Step 1: Initial state

    state = {
        input: "Find a good fishing rod under $1000",
        history: [],
        memory: []
    };

Step 2: Policy decision

    {
        "action": "search_products",
        "args": { "query": "fishing rod under 1500" }
    }

Step 3: Tool execution

    result = [
      { name: "Rod 1", price: 800 },
      { name: "Rod 2", price: 650 }
    ];

Step 4: Policy decision

    {
        "action": "search_products",
        "args": { "query": "fishing rod under 1000" }
    }

Step 5: State transition

    state = {
        ...state,
        history: [
          {
            action: "search_products",
            result
          }
        ]
    };

Key Takeaways

An agent is defined by its loop, not its model
State design directly determines decision quality
The LLM is just a policy function, not the system itself
Determinism is a configuration choice, not a default

Core Architecture

After defining what an agent, we need to see the structure of this system. The questions we need to answer is
how do we decompose an agent into components that are modular, testable, and maybe scalable?
The answer is that an agent can be represented as a composition of interacting modules:

    Agent=(π,M,T,E)

Where:

π = policy (LLM decision function)
M = memory system
T = toolset
E = execution runtime (loop + orchestration)

Conceptual Architecture

    User Input
        ↓
    State Builder (input + memory + history)
        ↓
    Policy (LLM)
        ↓
    Action (JSON)
        ↓
    Tool Executor
        ↓
    Result
        ↓
    Memory Update
        ↓
    Loop (repeat or terminate)

How I see it is this conceptual architecture you see above is that its more of a feedback system than a pipeline.

Data Flow

State → Policy

Serialize state into a prompt

Policy → Action

LLM outputs structured decision

Action → Tool

System executes external function

Tool → Result

Returns data to agent

Result → State Update

Incorporated into next iteration

Concrete representation:

    async function agentStep(state, memory){
      const prompt = buildPrompt(state, memory);

      const action = await llm(prompt);       // π(st)
      const parsed = parseAction(action);     // structured at

      const result = await execute(parsed);   // T(at)

      const nextState = updateState(state, parsed, result); // f(...)

      return { nextState, parsed, result };
    }

Serialization Boundary

A serialization boundary is the checkpoint, as for an agent to "packs its bags" to travel across a network or wait in storage it needs to take a formal format, like a JSON, YAML and TOON formats.

AT the end, the keypoint to remember is that the LLM cannot operate on objects, it operates on text.

So we define a serialization function:

    function buildPrompt(state) {
      return `
        You are an agent.

        User goal:
        ${state.input}

        History:
        ${JSON.stringify(state.history)}

        Available tools:
        ${JSON.stringify(toolSchemas)}
      `;
    }

Final Verdict: The serialization function is the encoding half of the process and the decoding half happens inside the LLM's "brain" when it parses your prompt to understand the context.

Memory Systems

Without memory, the agent reduces to just a stateless function, as it turns an agent from a reactive loop into a system capable of contextual reasoning and personalization.

Short Term Memory

Short term memory is what you pass directly into the model.

Implementation

    const history = [
      {
        action: { name: "search_products", args: { query: "fishing rod" } },
        result: [{ name: "Rod 1", price: 850 }]
      }
    ];

Injecting into prompt

    function buildPrompt(state) {
      return `
        User goal:
        ${state.input}

        History:
        ${JSON.stringify(state.history, null, 2)}
      `;
    }

Long Term Memory

Short term memory is insufficient for the agent to remember large documents, user preferences and cross session knowledge; well here persistent memory is introduced.

Storage Options

Database (PostgreSQL, MongoDB)
Vector database (for semantic search)
File based storage (simple file systems)

for example:

    await db.insert({
      userId: "123",
      text: "User prefers Scorpion fishing rods",
      createdAt: Date.now()
    });

Tooling and Action Execution

Memory allows an agent to think with context but tools allow an agent to act on the world.
With tools, it becomes an interactive system capable of retrieving data, triggering workflows, and producing side effects; and without these tools, an agent is limited to text generation.

What Makes a “Tool”

A tool is any callable function that:

Accepts structured input
Performs an operation (internal or external)
Returns a result to the agent

Examples of Tools

API calls (weather, search, payments)
File system operations
Computation utilities

for example:

    const tools = {
      getWeather: async ({ city }) => {
        const res = await fetch(`https://api.weather.com/${city}`);
        return res.json();
      }
    };

Bare in mind that in tools scope, timeouts matter a lot as without constraints, latency can go endelessly.
One slow tool can block the entire agent loop.

Core Runtime

The center of the entire system is the agent loop. Everything we’ve built so far, from policy, memory to tools, only becomes meaningful when orchestrated through a controlled execution loop

Minimal loop

    async function runAgent(input) {
      let state = {
        input,
        history: [],
        memory: []
      };

      for (let step = 0; step < 10; step++) {
        const action = await policy(state);
        const result = await execute(action);

        state = updateState(state, action, result);

        if (isDone(state, action)) break;
      }

      return state.output;
    }

Termination Conditions

Without termination logic, the loop is unbounded.

Practical Conditions

1. Explicit Final Action

if (action.type === "final"){
  return action.output;
}

2. Max Step Limit

if (step >= MAX_STEPS){
  throw new Error("Max steps exceeded");
}

3. Heuristic Completion

function isDone(state){
  return state.history.length > 0 &&
         state.history[state.history.length - 1].action.type === "final";
}

Why This Matters

Without termination, we will have:

Infinite loops
Unbounded cost
API rate issues

Conclusion

This article walked through what an AI agent actually looks like under the hood, from the control loop to memory and tools with small minimal JavaScript implementation. Keep in mind that this is not a deep or complete system just a minimal, educational implementation, basically just what I learned while exploring AI agents and there’s still a lot missing.
If you’re trying to learn this too, my advice is don’t start with frameworks, just try to build a small agent yourself; even a basic version will force you to understand a lot and that’s where the real learning happens.