DEV Community

Cover image for Your RAG Chatbot Lies Because You're Chunking Wrong
Shane Wilkey
Shane Wilkey

Posted on

Your RAG Chatbot Lies Because You're Chunking Wrong

Three weeks into building FolioChat — a chatbot that lets portfolio visitors talk to my GitHub — I had a system that gave answers like a drunk Wikipedia editor.

"Tell me about Shane's React work," someone would ask. It would respond with a rambling paragraph about my Python scripts, a random commit message, and somehow my college graduation year. The context was drifting so badly I was getting ready to scrap the whole thing.

The fix turned out to be a single architectural decision that most RAG tutorials skip entirely: chunking by meaning instead of by size.

The Problem: You're Splitting Text, Not Meaning

RAG — Retrieval Augmented Generation — sounds simple in tutorials. Chunk your documents, embed them, store them in a vector database, retrieve relevant chunks, answer questions.

Most RAG implementations break at step one: chunking. Token-based splitting, the default in every tutorial, treats your content like a deck of cards to be shuffled. It doesn't care if it splits a function definition from its docstring, or cuts a project description in half. Here's what I started with:

# Seemed reasonable. It was not.
def simple_chunk(text, chunk_size=500):
    tokens = text.split()
    for i in range(0, len(tokens), chunk_size):
        yield " ".join(tokens[i:i + chunk_size])
Enter fullscreen mode Exit fullscreen mode

When someone asked "What Python projects has Shane built?", the system would retrieve:

  1. Half of a README from a React project
  2. The middle of a commit message about fixing tests
  3. A random slice of documentation that mentioned Python once

Token-based splitting fragments meaning across chunks. The model never sees a complete thought — just shards of context that fit a size budget. So it synthesizes these fragments into something grammatically correct and factually useless.

The Fix: Chunk by Meaning, Not by Size

I almost killed the project before a frustrated Claude Desktop session saved it. The suggestion was simple: chunk by meaning instead of by size.

GitHub repositories have natural semantic boundaries. A README has sections. Projects have distinct characteristics. Code has logical groupings. The data has shape — I was just ignoring it.

The real Chunker in FolioChat uses a plain dataclass:

@dataclass
class Chunk:
    id: str
    type: str  # identity | project_overview | project_detail | project_tech | project_story
    content: str
    metadata: dict = field(default_factory=dict)

    def __post_init__(self):
        self.content = self.content.strip()
Enter fullscreen mode Exit fullscreen mode

Five chunk types, each serving a specific retrieval purpose:

  • identity — one chunk per portfolio, who this developer is at the highest level
  • project_overview — elevator pitch per repo, answers "tell me about X"
  • project_tech — pure stack signal per repo, answers "do you know PostgreSQL?"
  • project_story — the why, built from README intro and commit history
  • project_detail — one chunk per README section, architecture stays with architecture

The Chunker class builds all of these from a single portfolio_data dict:

class Chunker:
    def chunk(self, portfolio_data: dict) -> list[Chunk]:
        chunks = []

        chunks.append(self._identity_chunk(portfolio_data))

        for repo in portfolio_data["repos"]:
            chunks.append(self._overview_chunk(repo, portfolio_data["username"]))
            chunks.append(self._tech_chunk(repo, portfolio_data["username"]))
            chunks.append(self._story_chunk(repo, portfolio_data["username"]))
            chunks.extend(self._detail_chunks(repo, portfolio_data["username"]))

        return [c for c in chunks if len(c.content) > 20]
Enter fullscreen mode Exit fullscreen mode

When someone asks "What technologies does Shane use?", retrieval targets project_tech chunks instead of gambling on arbitrary text fragments. The question shape matches the chunk shape.

Three Embedding Backends, One Interface

Semantic chunking is the hard part. Embedding, by comparison, is a config choice. FolioChat supports three backends — local (free, no API key), OpenAI, and Voyage — all behind a common interface:

class LocalEmbedder(BaseEmbedder):
    MODEL_NAME = "all-MiniLM-L6-v2"

    def __init__(self):
        from sentence_transformers import SentenceTransformer
        self._model = SentenceTransformer(self.MODEL_NAME)

    def embed(self, texts: list[str]) -> list[list[float]]:
        embeddings = self._model.encode(texts, convert_to_numpy=True)
        return embeddings.tolist()

    def embed_query(self, text: str) -> list[float]:
        return self.embed([text])[0]

    @property
    def dimension(self) -> int:
        return 384
Enter fullscreen mode Exit fullscreen mode

The local embedder runs entirely on your machine — 384-dimension vectors, no API key, no cost. Swap backends with one config change — the rest of your code doesn't know the difference.

The ChromaDB Gotcha Everyone Hits

ChromaDB expects string values in metadata. That sounds fine until you try to store lists or datetimes:

# This breaks ChromaDB
metadata = {
    "languages": ["Python", "JavaScript"],  # lists don't serialize
    "created_at": datetime.now()            # datetimes don't serialize
}

# This works
metadata = {
    "languages": json.dumps(["Python", "JavaScript"]),
    "created_at": datetime.now().isoformat()
}
Enter fullscreen mode Exit fullscreen mode

FolioChat serializes lists to JSON strings before storing and deserializes on the way out. Annoying the first time, invisible after that.

Before and After: What Semantic Chunking Actually Fixed

The drunk Wikipedia editor became a chatbot that gives accurate, specific answers. The difference between "Shane has worked on various projects involving different technologies over time" and "Shane's FolioChat project is a RAG pipeline using FastAPI, ChromaDB, and sentence-transformers, deployed on Railway" is the RAG chunking strategy — nothing else.

The full implementation — chunker, embedder, ChromaDB layer, FastAPI server, and React widget — is here. Install it, point it at your GitHub username, and have a working portfolio chatbot in about five minutes.

Next week: the complete FolioChat architecture walkthrough, and why most developer portfolios are effectively invisible to the people looking at them.

Tags: rag, python, chromadb, embeddings, llm
Series: Building With AI Agents — Article 5 of 12

Top comments (0)