Fortune Ndlovu

Posted on May 5

Build a RAG agent with LangChain and Ollama

#python #rag #langchain #ollama

I started where a lot of us do: a LangChain RAG walkthrough. You chunk some text, embed it, retrieve top‑k chunks, and wire an LLM to answer questions. It clicks quickly, which is exactly why it’s easy to walk away thinking you’ve “done RAG.”

What bothered me was that the demo corpus is usually tiny and artificial, which is by design. I write on DEV.to about things like NLP routing and CNN image classification. If I can’t point a system at my own posts and get answers I can verify, I’m not building something close to “real world.”

So I pivoted: pull my public articles through the DEV.to API, index them in a persistent vector store, and chat over that index.

Then I hit a second wall that matters in practice: embedding APIs have quotas. When Google’s embedding tier said “not today,” I didn’t want the project to die. I switched embeddings to Ollama (nomic-embed-text) so indexing stays local and repeatable.

At that point I had a working assistant, but I still had an open question that tutorials rarely answer honestly:

Which wiring pattern actually wins when everything else is held constant?

Not “which blog post says CRAG is fancy,” but: same corpus, same embeddings, same chat model, and three different retrieval/orchestration strategies compared with traces and numbers.

So I turned the repo into a small RAG strategy lab: a tool-calling agent, an always-retrieve chain, and a corrective RAG graph. The punchline (benchmarks) lives at the end, swap in your own runs when you reproduce this.

Concepts I’m leaning on (short, practical definitions)

Retrieval-Augmented Generation (RAG) means: before the model answers, you fetch relevant passages from your sources and condition the answer on them. The failure modes are familiar, bad chunks, wrong retrieval, model ignores context. which is why “one demo implementation” isn’t enough.

Three strategies I compare

Tool-calling agent the model decides if and when to call retrieval tools. Flexible: it can retrieve multiple times, list articles, or pull full markdown for one URL when snippets aren’t enough.
Always-retrieve chain every user turn runs similarity search first; retrieved passages are injected into the system prompt via middleware. Predictable: one retrieval per turn (unless you change k), no tool loop.
Corrective RAG (CRAG) retrieve, then grade whether the passages are sufficient; if not, rewrite the query and retrieve again (with a hard cap on retries), then generate an answer.

Markdown-aware chunking technical posts have structure (headings, sections). Splitting on headers first keeps section boundaries and metadata like h1/h2/h3 useful for citations.

Content-hash dedup store a hash of each article body in chunk metadata. On re-ingest, skip unchanged articles so you’re not re-embedding the internet every time.

LangSmith tracing so every ingest step, chat turn, and benchmark example shows up as a run you can inspect. LangSmith Evaluate runs your target function over a dataset and attaches scores.

Persistent threads (SqliteSaver) conversation state survives restarts. Useful when you’re dogfooding the chat UI like a product, not a one-off script.

Prerequisites

You’ll need:

Python 3.11+ (that’s what this repo was tested with; 3.10+ should work if dependencies install cleanly).
A DEV.to username with public posts (swap DEVTO_USERNAME to yours).
Ollama if you want local embeddings and/or local chat.
- For this project: ollama pull nomic-embed-text and a chat model such as ollama pull qwen2.5:7b.
LangSmith API key if you want traces and benchmarks (LANGSMITH_API_KEY).
Google Gemini API key if you use Gemini for chat, embeddings, or the benchmark judge (GOOGLE_API_KEY). If everything is Ollama, you can skip Google but the default bench setup often keeps a stronger judge on Gemini while chat stays local.

Install and `.env` (from the project root)

python -m venv .venv
source .venv/bin/activate
# Windows PowerShell: .\.venv\Scripts\Activate.ps1
pip install -r requirements.txt
cp .env.example .env

Edit .env with your keys, DEVTO_USERNAME, and model provider choices. The repo’s .env.example lists every variable with comments.

Repo layout (what each file is for)

.
├── rag_agent.py              # CLI entry: python rag_agent.py <command>
├── bench.py                  # Alternate entry for benchmarks
├── requirements.txt
├── .env.example
├── evals/
│   ├── qa.yaml               # Golden questions + expected substrings + source hints
│   └── results-*.md        # Written by bench (gitignored pattern in .gitignore)
└── rag_lab/
    ├── __init__.py
    ├── __main__.py           # python -m rag_lab
    ├── config.py             # Settings, tracing, model + store factories
    ├── ingest.py             # DEV.to fetch, chunk, dedup, Chroma upsert
    ├── tools.py              # Tools for the agent strategy only
    ├── cli.py                # ingest | chat | bench
    ├── bench.py              # LangSmith dataset + evaluate runner
    └── strategies/
        ├── __init__.py       # STRATEGIES registry
        ├── prompts.py
        ├── agent.py          # Tool-calling agent
        ├── chain.py          # Always-retrieve via @dynamic_prompt
        └── crag.py           # Corrective RAG StateGraph

Build walkthrough: what goes in each file

Keep in mind you are not missing steps, the full files live in this repo https://github.com/Fortune-Ndlovu/Build-a-RAG-agent-with-LangChain. Under each heading I explain why the piece matters, show the smallest slice that carries the idea, then point you at the full file for copy-paste.

`requirements.txt`

Pinned dependency list: LangChain, LangGraph, Chroma, Ollama, LangSmith, etc.

Full source in repo: requirements.txt

`rag_agent.py`

Thin entry point so readers run python rag_agent.py from the repo root without installing a package in editable mode.

#!/usr/bin/env python3
"""Entry point: run from repo root, e.g. `python rag_agent.py chat`."""

from rag_lab.cli import main

if __name__ == "__main__":
    main()

Full source in repo: rag_agent.py

`bench.py`

Same pattern for python bench.py → rag_lab.bench.main.

Full source in repo: bench.py

`rag_lab/config.py`

Everything reads one Settings object so switching providers is an env change, not a rewrite.

Conditional Google key: you only need GOOGLE_API_KEY when chat, judge, or embeddings use Google otherwise local-only setups stay simple.

@model_validator(mode="after")
def _validate_google_and_export_key(self) -> Settings:
    cp = self.chat_provider.lower().strip()
    jp = self.judge_provider.lower().strip()
    ep = self.embedding_provider.lower().strip()

    needs_google = cp == "google" or jp == "google" or ep == "google"
    if needs_google and self.google_api_key is None:
        raise ValueError(
            "GOOGLE_API_KEY is required when CHAT_PROVIDER, JUDGE_PROVIDER, or "
            "EMBEDDING_PROVIDER is google (or chat_model is a google_genai:* id)."
        )
    if self.google_api_key is not None:
        os.environ["GOOGLE_API_KEY"] = self.google_api_key.get_secret_value()
    return self

Ollama chat models need an explicit provider bare names like qwen2.5:7b don’t always infer correctly:

def _init_chat_for_provider(*, model_id: str, provider: str, ollama_base_url: str) -> Any:
    p = provider.lower().strip()
    if p == "ollama":
        return init_chat_model(
            model_id,
            model_provider="ollama",
            base_url=ollama_base_url,
        )
    # ...

LangSmith env so traces and evals land in your project:

def setup_tracing(settings: Settings) -> None:
    if not settings.langsmith_api_key:
        _log.info("LangSmith disabled (no LANGSMITH_API_KEY).")
        return
    key = settings.langsmith_api_key.get_secret_value()
    os.environ["LANGSMITH_TRACING"] = str(settings.langsmith_tracing).lower()
    os.environ["LANGSMITH_API_KEY"] = key
    os.environ["LANGSMITH_PROJECT"] = settings.langsmith_project
    # ...

Full source in repo: rag_lab/config.py

`rag_lab/ingest.py`

DEV.to gives you a list endpoint and a per-article endpoint. I fetch summaries, then full body_markdown per id, chunk, and upsert.

Markdown headers first, then character split preserves structure in metadata for citations:

md_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=[("#", "h1"), ("##", "h2"), ("###", "h3")],
    strip_headers=False,
)
char_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1200,
    chunk_overlap=150,
    add_start_index=True,
)

Dedup: same body → same hash → skip re-embed for that article id.

def compute_changes(
    articles: list[dict[str, Any]], existing: dict[str, str]
) -> tuple[list[dict[str, Any]], int]:
    to_process: list[dict[str, Any]] = []
    skip = 0
    for art in articles:
        body = art.get("body_markdown") or ""
        h = _sha256(body)
        aid = str(art["id"])
        if existing.get(aid) == h:
            skip += 1
            continue
        to_process.append(art)
    return to_process, skip

@traceable on ingest steps gives you a span hierarchy in LangSmith for “what did ingest do?”

Full source in repo: rag_lab/ingest.py

`rag_lab/tools.py`

retrieve_context uses response_format="content_and_artifact" so the model sees serialized text, but the CLI can recover structured Document objects for a sources panel.

@tool(response_format="content_and_artifact")
def retrieve_context(query: str, k: int = 4) -> tuple[str, list]:
    """Retrieve the most similar passages from the indexed DEV.to articles."""
    docs = vector_store.similarity_search(query, k=int(k))
    serialized = "\n\n".join(
        (f"Source: {doc.metadata}\nContent: {doc.page_content}")
        for doc in docs
    )
    return serialized, docs

Full source in repo: rag_lab/tools.py

`rag_lab/strategies/prompts.py`

Three strategies means three prompt contracts especially CRAG, where grading and rewriting are separate tasks with tight output formats (YES/NO, one-word rewrite).

Example excerpt agent tool policy + citation behavior:

AGENT_SYSTEM = """You answer questions using tools backed by the author's indexed DEV.to articles.

Tools:
- retrieve_context: semantic search over passages (use for factual detail and quotes).
- list_articles: browse titles/URLs/tags without retrieval (use for "what posts exist?" questions).
- fetch_article: load full markdown for one URL when snippets are insufficient.
...
"""

Full source in repo: rag_lab/strategies/prompts.py

`rag_lab/strategies/agent.py`

This is the “let the model decide” baseline tools + system prompt, LangChain create_agent.

def build(model, vector_store, settings: Settings, *, checkpointer: Any = None, name: str | None = "strategy_agent"):
    tools = make_rag_tools(settings, vector_store)
    return create_agent(
        model,
        tools,
        system_prompt=AGENT_SYSTEM,
        checkpointer=checkpointer,
        name=name,
    )

Full source in repo: rag_lab/strategies/agent.py

`rag_lab/strategies/chain.py`

No tools middleware runs retrieval on every turn and stuffs <passage> blocks into the prompt. Great when you want “RAG always on” and minimal branching.

@dynamic_prompt
def prompt_with_context(request: ModelRequest) -> str:
    last = request.messages[-1]
    lc = getattr(last, "content", "")
    last_query = lc if isinstance(lc, str) else str(lc)
    docs = vector_store.similarity_search(last_query, k=4)
    docs_content = "\n\n".join(
        f"<passage source={doc.metadata!r}>\n{doc.page_content}\n</passage>"
        for doc in docs
    )
    return (
        CHAIN_SYSTEM
        + "\n<context>\n"
        + docs_content
        + "\n</context>"
    )

Full source in repo: rag_lab/strategies/chain.py

`rag_lab/strategies/crag.py`

Explicit graph grade sufficiency, optionally rewrite query, cap retries.

def route_after_grade(state: CRAGState) -> str:
    if state.get("sufficient"):
        return "generate"
    if state.get("attempts", 0) >= 2:
        return "generate"
    return "rewrite"

graph = StateGraph(CRAGState)
graph.add_node("prepare", prepare)
graph.add_node("retrieve", retrieve)
graph.add_node("grade", grade)
graph.add_node("rewrite", rewrite)
graph.add_node("generate", generate)

graph.add_edge(START, "prepare")
graph.add_edge("prepare", "retrieve")
graph.add_edge("retrieve", "grade")
graph.add_conditional_edges("grade", route_after_grade, {"generate": "generate", "rewrite": "rewrite"})
graph.add_edge("rewrite", "retrieve")
graph.add_edge("generate", END)

return graph.compile(checkpointer=checkpointer, name=name or "strategy_crag")

Full source in repo: rag_lab/strategies/crag.py

`rag_lab/strategies/init.py`

Strategy registry for the CLI and bench:

STRATEGIES: dict = {
    "agent": agent.build,
    "chain": chain.build,
    "crag": crag.build,
}

Full source in repo: rag_lab/strategies/__init__.py

`rag_lab/cli.py`

SqliteSaver gives you durable threads; slash commands let you switch strategies or resume a thread without restarting Python.

Sources panel from tool artifacts:

def _format_tool_sources(messages: list) -> str:
    lines: list[str] = []
    for m in messages:
        if not isinstance(m, ToolMessage):
            continue
        art = getattr(m, "artifact", None)
        # ... Document metadata → readable lines ...

Important: run bench from the shell, not inside the chat REPL otherwise your shell command becomes just another user message (ask me how I know).

Full source in repo: rag_lab/cli.py

`rag_lab/bench.py`

ensure_dataset uploads evals/qa.yaml once; client.evaluate runs each strategy against the same examples.

Retrieval score did we surface a URL containing the expected substring?

def retrieval_recall(run: Any, example: Any) -> EvaluationResult:
    ref = (example.outputs or {}) if example else {}
    expected_url = ref.get("expected_source_url_contains") or ""
    if not expected_url:
        return EvaluationResult(key="retrieval_recall", score=1.0)
    outs = run.outputs or {}
    urls = outs.get("retrieved_urls") or []
    hit = any(expected_url in u for u in urls)
    return EvaluationResult(key="retrieval_recall", score=1.0 if hit else 0.0)

Correctness LLM-as-judge against an expected substring (case-flexible intent):

def llm_judge_correctness(run: Any, example: Any) -> EvaluationResult:
    # ...
    verdict = judge.invoke([HumanMessage(content=prompt)])
    text = (verdict.content or "").strip().lower()
    ok = text.startswith("y")
    return EvaluationResult(key="correctness", score=1.0 if ok else 0.0)

Full source in repo: rag_lab/bench.py

`evals/qa.yaml`

Twelve labeled examples (factual, multi-hop, negative). Two excerpt rows:

- id: nlp-conda
  question: What conda environment name does the intelligent support ticket routing post recommend creating?
  expected_answer_substring: ticket
  expected_source_url_contains: intelligent-support-ticket-routing
  category: factual

- id: negative-out-of-scope
  question: What Kubernetes operator pattern does the author's DEV feed recommend for production clusters?
  expected_answer_substring: do not know
  expected_source_url_contains: ""
  category: negative

Full source in repo: evals/qa.yaml

`rag_lab/main.py` and `rag_lab/init.py`

Package entry and version string small files; link for completeness.

Exact log output (full transcript from one session)

Below is the complete terminal output from my machine: first pulling models and ingesting the corpus, then running the same three questions under each strategy (agent, then chain, then crag) in one REPL session. I did not trim the Assistant replies or the httpx lines so you can see exactly what showed up.

Setup: Ollama pulls + ingest

(.venv) PS C:\Users\ndlov\Documents\Build-a-RAG agent-with-LangChain> ollama pull nomic-embed-text
pulling manifest
pulling 970aa74c0a90: 100% ▕███████████████████████████████████████████████████████████████████████████▏ 274 MB
pulling c71d239df917: 100% ▕███████████████████████████████████████████████████████████████████████████▏  11 KB
pulling ce4a164fc046: 100% ▕███████████████████████████████████████████████████████████████████████████▏   17 B
pulling 31df23ea7daa: 100% ▕███████████████████████████████████████████████████████████████████████████▏  420 B
verifying sha256 digest
writing manifest
success
(.venv) PS C:\Users\ndlov\Documents\Build-a-RAG agent-with-LangChain> ollama pull qwen2.5:7b
pulling manifest
pulling 2bada8a74506: 100% ▕███████████████████████████████████████████████████████████████████████████▏ 4.7 GB
pulling 66b9ea09bd5b: 100% ▕███████████████████████████████████████████████████████████████████████████▏   68 B
pulling eb4402837c78: 100% ▕███████████████████████████████████████████████████████████████████████████▏ 1.5 KB
pulling 832dd9e00a68: 100% ▕███████████████████████████████████████████████████████████████████████████▏  11 KB
pulling 2f15b3218f05: 100% ▕███████████████████████████████████████████████████████████████████████████▏  487 B
verifying sha256 digest
writing manifest
success
(.venv) PS C:\Users\ndlov\Documents\Build-a-RAG agent-with-LangChain> python rag_agent.py ingest
01:55:20 INFO    rag_lab.config | LangSmith tracing enabled → https://smith.langchain.com (project=rag-agent-tutorial)
01:55:29 INFO    rag_lab.cli | Ingest complete: {'articles_found': 10, 'articles_skipped_unchanged': 10, 'articles_updated': 0, 'chunks_written': 0}
{'articles_found': 10, 'articles_skipped_unchanged': 10, 'articles_updated': 0, 'chunks_written': 0}

First strategy: `agent` (tool-calling)

(.venv) PS C:\Users\ndlov\Documents\Build-a-RAG agent-with-LangChain> python rag_agent.py chat --strategy agent
01:56:06 INFO    rag_lab.config | LangSmith tracing enabled → https://smith.langchain.com (project=rag-agent-tutorial)
Thread: 6711ee83-c6ec-41d7-bba3-b188475f5575  |  strategy: agent
Commands: /help /sources /reset /rebuild /threads /resume <id> /strategy <name> /quit

You> What conda environment name does the ticket routing article recommend?
01:57:21 INFO    httpx | HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
01:57:25 INFO    httpx | HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
01:58:43 INFO    httpx | HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"

Assistant> The article recommends using a conda environment named `ticket-nlp` for setting up the development environment. You can create this environment by running:

conda create -n ticket-nlp python=3.10 -y
conda activate ticket-nlp

You can find more details in the "Prerequisites" section of the article titled [Intelligent Support Ticket Routing with Natural Language Processing (NLP)](https://dev.to/fortune-ndlovu/intelligent-support-ticket-routing-with-natural-language-processing-nlp-57g1).

Sources:
  [1] Intelligent Support Ticket Routing with Natural Language Processing (NLP) — 1. Problem Statement and Goal
      https://dev.to/fortune-ndlovu/intelligent-support-ticket-routing-with-natural-language-processing-nlp-57g1
  [2] Intelligent Support Ticket Routing with Natural Language Processing (NLP) — 9. Conclusion
      https://dev.to/fortune-ndlovu/intelligent-support-ticket-routing-with-natural-language-processing-nlp-57g1
  [3] Intelligent Support Ticket Routing with Natural Language Processing (NLP) — 3. Load and Explore Dataset
      https://dev.to/fortune-ndlovu/intelligent-support-ticket-routing-with-natural-language-processing-nlp-57g1
  [4] Intelligent Support Ticket Routing with Natural Language Processing (NLP) — 2. Prerequisites
      https://dev.to/fortune-ndlovu/intelligent-support-ticket-routing-with-natural-language-processing-nlp-57g1

LangSmith project: https://smith.langchain.com (project=rag-agent-tutorial)

You> What class name is used for the CNN in the image classification post?
01:59:41 INFO    httpx | HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
01:59:44 INFO    httpx | HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
02:01:19 INFO    httpx | HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"

Assistant> The class name used for the CNN in the image classification post is `CNN`. This can be seen in the following code snippet from the article:


class CNN(nn.Module):
def __init__(self):
    super(CNN, self).__init__()
    # Layer definitions...

You can find this information in the "Define the CNN Model" section of the article titled [Image Classification with Convolutional Neural Networks (CNNs)](https://dev.to/fortune-ndlovu/image-classification-and-convolutional-neural-networks-cnns-4bdl).

Sources:
  [1] Image Classification with Convolutional Neural Networks (CNNs) — Define the CNN Model
      https://dev.to/fortune-ndlovu/image-classification-and-convolutional-neural-networks-cnns-4bdl
  [2] Image Classification with Convolutional Neural Networks (CNNs) — Conclusion
      https://dev.to/fortune-ndlovu/image-classification-and-convolutional-neural-networks-cnns-4bdl
  [3] Image Classification with Convolutional Neural Networks (CNNs) — Reshape for CNN Input
      https://dev.to/fortune-ndlovu/image-classification-and-convolutional-neural-networks-cnns-4bdl
  [4] Install Red Hat Developer hub (RHDH) in a fully air-gapped Minikube environment — 1, Download and Save Container Images (Locally)  
      https://dev.to/fortune-ndlovu/install-red-hat-developer-hub-rhdh-in-a-fully-air-gapped-minikube-environment-4ngk

LangSmith project: https://smith.langchain.com (project=rag-agent-tutorial)

You> Summarize how the MNIST example sets up the model.
02:01:56 INFO    httpx | HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"

Assistant> The MNIST example in the article [Image Classification with Convolutional Neural Networks (CNNs)] sets up the model using a class named `CNN`. Here's a summary of how the model is defined and structured:

1. **Class Definition**: The `CNN` class inherits from `nn.Module`.

2. **Convolutional Layers**:
   - `conv1`: A convolutional layer with 32 filters, each of size 3x3.
   - `conv2`: Another convolutional layer with 64 filters, each of size 3x3.

3. **Pooling Layer**:
   - `pool`: A max pooling layer that reduces the spatial dimensions by a factor of 2.

4. **Flattening**:
   - The output from the convolutional layers is flattened into a one-dimensional vector to be fed into fully connected layers.

5. **Fully Connected Layers**:
   - `fc1`: A fully connected layer with 64 neurons.
   - `fc2`: Another fully connected layer with 10 neurons, corresponding to the 10 possible digit classes (0 through 9).

6. **Dropout and Non-linearity**:
   - A dropout layer is used to prevent overfitting.
   - ReLU activations are applied after convolutional layers for non-linear transformations.

Here's a simplified version of the model definition:

import torch.nn as nn

class CNN(nn.Module):
    def __init__(self):
        super(CNN, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, kernel_size=3, padding=1)
        self.pool = nn.MaxPool2d(kernel_size=2, stride=2)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
        self.fc1 = nn.Linear(in_features=64 * 7 * 7, out_features=64)
        self.dropout = nn.Dropout(p=0.5)
        self.fc2 = nn.Linear(in_features=64, out_features=10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 64 * 7 * 7)
        x = F.relu(self.fc1(x))
        x = self.dropout(x)
        x = self.fc2(x)
        return x

This model processes input images by first applying convolutional layers to extract features, then using pooling to reduce dimensionality, and finally passing the flattened feature maps through fully connected layers to produce class scores for the 10 digit classes.        

Sources:
  (no retrieve_context tool calls in this turn — see LangSmith trace for internal retrieval.)

LangSmith project: https://smith.langchain.com (project=rag-agent-tutorial)

Second strategy: `chain` (always-retrieve)

You> /strategy chain
Strategy -> chain

You> What conda environment name does the ticket routing article recommend?
02:04:04 INFO    httpx | HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
02:07:40 INFO    httpx | HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"

Assistant> The ticket routing article recommends using a conda environment named `ticket-nlp` for setting up the development environment. You can create this environment by running:

conda create -n ticket-nlp python=3.10 -y
conda activate ticket-nlp

You can find more details in the "Prerequisites" section of the article [Intelligent Support Ticket Routing with Natural Language Processing (NLP)](https://dev.to/fortune-ndlovu/intelligent-support-ticket-routing-using-natural-language-processing-4b7e).

Sources:
  (no retrieve_context tool calls in this turn — see LangSmith trace for internal retrieval.)

LangSmith project: https://smith.langchain.com (project=rag-agent-tutorial)

You> What class name is used for the CNN in the image classification post?
02:08:13 INFO    httpx | HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
02:11:54 INFO    httpx | HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"

Assistant> The class name used for the CNN in the image classification post is `CNN`. This can be seen in the following code snippet:

class CNN(nn.Module):
    def __init__(self):
        super(CNN, self).__init__()
        # Layer definitions...

This definition is found in the "Define the CNN Model" section of the article [Image Classification with Convolutional Neural Networks (CNNs)](https://dev.to/fortune-ndlovu/image-classification-and-convolutional-neural-networks-cnns-4bdl).

Sources:
  (no retrieve_context tool calls in this turn — see LangSmith trace for internal retrieval.)

LangSmith project: https://smith.langchain.com (project=rag-agent-tutorial)

You> Summarize how the MNIST example sets up the model.
02:12:31 INFO    httpx | HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
02:14:59 INFO    httpx | HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"

Assistant> The MNIST example in the article [Image Classification with Convolutional Neural Networks (CNNs)] sets up the model using a `CNN` class. Here’s a summary of the model setup:

1. **Class Definition**: The `CNN` class inherits from `nn.Module`.

2. **Convolutional Layers**:
   - `conv1`: A convolutional layer with 32 filters, each of size 3x3.
   - `conv2`: Another convolutional layer with 64 filters, each of size 3x3.

3. **Pooling Layer**:
   - `pool`: A max pooling layer that reduces the spatial dimensions by a factor of 2.

4. **Fully Connected Layers**:
   - `fc1`: A fully connected layer with 64 neurons.
   - `fc2`: Another fully connected layer with 10 neurons, corresponding to the 10 possible digit classes (0 through 9).

5. **Dropout and Non-linearity**:
   - A dropout layer is used to prevent overfitting.
   - ReLU activations are applied after convolutional layers for non-linear transformations.

Here’s a simplified version of the model definition:

class CNN(nn.Module):
    def __init__(self):
        super(CNN, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, kernel_size=3, padding=1)
        self.pool = nn.MaxPool2d(kernel_size=2, stride=2)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
        self.fc1 = nn.Linear(in_features=64 * 7 * 7, out_features=64)
        self.dropout = nn.Dropout(p=0.5)
        self.fc2 = nn.Linear(in_features=64, out_features=10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 64 * 7 * 7)
        x = F.relu(self.fc1(x))
        x = self.dropout(x)
        x = self.fc2(x)
        return x

This model processes input images by first applying convolutional layers to extract features, then using pooling to reduce dimensionality, and finally passing the flattened feature maps through fully connected layers to produce class scores for the 10 digit classes.        

Sources:
  (no retrieve_context tool calls in this turn — see LangSmith trace for internal retrieval.)

LangSmith project: https://smith.langchain.com (project=rag-agent-tutorial)

Third strategy: `crag` (corrective RAG)

You> /strategy crag 
Strategy -> crag

You> What conda environment name does the ticket routing article recommend?
02:17:40 INFO    httpx | HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
02:18:50 INFO    httpx | HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
02:18:54 INFO    httpx | HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
02:18:58 INFO    httpx | HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
02:20:05 INFO    httpx | HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
02:21:12 INFO    httpx | HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"

Assistant> The ticket routing article recommends the conda environment name `ticket-nlp`. This can be seen in the passage where it states:

conda create -n ticket-nlp python=3.10 -y
conda activate ticket-nlp

This command creates and activates a new conda environment named `ticket-nlp`.

Sources:
  (no retrieve_context tool calls in this turn — see LangSmith trace for internal retrieval.)

LangSmith project: https://smith.langchain.com (project=rag-agent-tutorial)

You> What class name is used for the CNN in the image classification post?
02:21:31 INFO    httpx | HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
02:22:44 INFO    httpx | HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
02:23:56 INFO    httpx | HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"

Assistant> The class name used for the CNN in the image classification post is `CNN`. This can be seen in the code snippet provided within the passage where the class definition starts with `class CNN(nn.Module):`.

Sources:
  (no retrieve_context tool calls in this turn — see LangSmith trace for internal retrieval.)

LangSmith project: https://smith.langchain.com (project=rag-agent-tutorial)

You> Summarize how the MNIST example sets up the model.                    
02:24:14 INFO    httpx | HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
02:25:43 INFO    httpx | HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
02:25:46 INFO    httpx | HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
02:25:51 INFO    httpx | HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
02:27:13 INFO    httpx | HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
02:28:35 INFO    httpx | HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"

Assistant> The MNIST example sets up a CNN model with the following architecture:

1. It uses convolutional layers (`conv1` and `conv2`) to detect local features such as edges and curves in input images.
2. Max pooling layers are used to reduce spatial dimensions, making the model more efficient and helping to prevent overfitting.
3. The output from the convolutional layers is flattened.
4. Fully connected layers (`fc1` and `fc2`) process the flattened data to make predictions.
5. ReLU activations provide non-linearity.
6. A dropout layer is included to further reduce overfitting.

This model transforms input images into class scores representing digits 0 through 9, effectively classifying handwritten digit images from the MNIST dataset.

Sources:
  (no retrieve_context tool calls in this turn — see LangSmith trace for internal retrieval.)

LangSmith project: https://smith.langchain.com (project=rag-agent-tutorial)

What you can already see in the logs (before any benchmark table)

The three strategies are not the same execution pattern.

agent: You get multiple POST .../api/chat calls per question because the model decides what to do next, and POST .../api/embed when it calls retrieve_context. The Sources panel lists real ToolMessage artifacts when retrieval ran and you can see noise too (for example, one retrieved hit points at an unrelated article URL in the CNN question). That is normal top-k behavior; benchmarks quantify whether it hurts answers.
chain: Each question shows a simple rhythm: one embed (similarity search) and one chat completion. There are no tool-call artifacts, so the CLI prints the “no retrieve_context tool calls” line and points you to LangSmith for internal retrieval. Compare the link in the conda answer to the agent run: the URL slug can differ while the prose still sounds confident.
crag: Count the HTTP lines on the first question: extra /chat calls align with grade (and sometimes rewrite) before generate, and you may see two embed rounds when retrieval runs again after a rewrite. CRAG can spend more time per question than chain.

Don’t run benchmarks from inside the chat REPL

After the CRAG transcript above, I accidentally pasted python rag_agent.py bench --strategy all as chat input (the small “what not to do” block right under this paragraph). The model answered from posts not what you want for LangSmith eval. Exit with /quit, then run bench from the shell; full benchmark logs go in the next section once your run finishes.

What not to do (mistyped `bench` inside the REPL)

You> python rag_agent.py bench --strategy all
02:29:07 INFO    httpx | HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
02:30:20 INFO    httpx | HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
02:30:23 INFO    httpx | HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
02:30:27 INFO    httpx | HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
02:32:11 INFO    httpx | HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
02:32:14 INFO    httpx | HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
02:32:17 INFO    httpx | HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
02:33:31 INFO    httpx | HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
02:34:45 INFO    httpx | HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"

Assistant> I do not know. The provided passages do not contain any information related to the command `python rag_agent.py bench --strategy all`. There is no context or code snippet that describes what this command does.

Sources:
  (no retrieve_context tool calls in this turn — see LangSmith trace for internal retrieval.)

Benchmarks (results)

The transcript above is the qualitative half (same three questions, three strategies). This section is the quantitative half what one automated evaluation run looked like when scored against a small golden set (evals/qa.yaml). The screenshot is enough for readers to see the shape of the results; you do not need my hosted tracing UI to follow the story.

What the numbers say (tool-calling `agent`)

Twelve questions, same dataset as the rest of the lab:

Metric	Value
Correctness (LLM-as-judge mean)	0.75
Retrieval (expected source URL present in retrieved URLs)	1.00
Latency (P50)	134.36 s
Total tokens	44,422 (input 41,661 \| output 2,761)
Cost	$0.00 with local chat/embeddings and a hosted judge configured as in `.env.example`

Finding: retrieval hit 1.00 but correctness only 0.75. In plain terms, the system usually pulled in the right article (URL checks passed), but the answer still failed the judge 3 times out of 12 wrong extraction, wrong paraphrase, or the model ignored what it had. So the bottleneck for this run was generation and grounding, not “find the post.” Latency was heavy (roughly 58 s–388 s per row in the UI; ~134 s median) because each example combined slow local inference with multi-step agent behavior plus a judge call.

`chain` and `crag` on the same harness

I ran chain and CRAG through the same benchmark script and YAML; wall-clock and judge quotas made those runs noisier to interpret than the agent screenshot, so I am not pasting a second dashboard here. Qualitatively they match what we already saw in the REPL logs: chain is the steady “one retrieval, one reply” pattern; CRAG fires extra model steps when it grades or rewrites before answering.

What I’d actually choose

If you want the simplest thing that behaves: chain. Always-retrieve keeps behavior predictable and avoids tool-loop variance good default when your corpus is stable and questions look like Q&A over passages.
If you want citations and optional browsing: agent. Tools plus artifacts are worth it when users jump between “list my posts,” “pull this URL,” and “quote this section.” The benchmark above shows retrieval can look perfect while answers still slip so don’t assume tools fix sloppy prompts or a weak chat model.
If retrieval quality is your pain: crag. Grading and query rewrite cost more calls; use it when evals show bad first hits, not when the model already gets the right chunks but misreads them.

My takeaway for this project: with retrieval already at 1.00 on this slice, I would upgrade answer quality first (stronger local model, stricter “cite or abstain” prompts, maybe a reranker) before I chase a fancier graph. Chain remains my practical default for a small personal corpus; agent stays the demo-friendly mode with the sources panel; CRAG is the key I turn on when similarity search alone stops matching the question.

Twelve rows is only enough to compare wiring, not to crown a forever winner rerun when you change models or grow qa.yaml.

Thank you for reading!

Follow me on LinkedIn.

Top comments (1)

ArjunPatel • May 15

The retrieval score 1.0 but correctness only 0.75 is the killer stat. The system found the right article every time. It had the information. But three times out of twelve, it still answered wrong. That's not a retrieval problem. That's a generation problem. The model had the right context and still messed up the answer. Better prompts, not better search, would fix that. Most people would blame the retriever first. They'd be wrong.

Concepts I’m leaning on (short, practical definitions)

Prerequisites

Install and .env (from the project root)

Repo layout (what each file is for)

Build walkthrough: what goes in each file

requirements.txt

rag_agent.py

bench.py

rag_lab/config.py

rag_lab/ingest.py

rag_lab/tools.py

rag_lab/strategies/prompts.py

rag_lab/strategies/agent.py

rag_lab/strategies/chain.py

rag_lab/strategies/crag.py

rag_lab/strategies/__init__.py

rag_lab/cli.py

rag_lab/bench.py

evals/qa.yaml

rag_lab/__main__.py and rag_lab/__init__.py