A backend engineer's first step into AI Engineering: embeddings, vector search, and the chunking bug that made everything click.
Why I decided to pivot toward AI Engineering
I have been a backend engineer for a while now: TypeScript, NestJS, distributed systems, APIs in production. I like that work. But at some point I started paying attention to a specific career trajectory I came across: someone with a background almost identical to mine who had moved into AI Engineering. Not abandoned backend, extended it.
That reframed everything for me. This wasn't a pivot away from what I knew. It was a direction to grow into. And I decided to start from the fundamentals, not from the tooling.
So instead of installing LangChain and following a tutorial, I built a RAG pipeline from scratch, no abstractions, no magic. Just Python, the Gemini API, and ChromaDB. Here is what I learned.
What RAG actually is
Before writing a line of code, I needed a mental model that made sense to me as an engineer.
RAG stands for Retrieval-Augmented Generation. The idea is simple: LLMs have frozen knowledge (their training cutoff) and a limited context window. You cannot feed an entire codebase or document library into a single prompt. RAG solves this by fetching only the relevant fragments at query time and injecting them into the context before the LLM responds.
Think of it as hiring a brilliant consultant who knows nothing about your company. Instead of retraining them from scratch, you hand them the relevant documents before each meeting. That is RAG.
The pipeline has two phases:
INDEXING (runs once):
Document → chunking → embeddings → vector database
QUERYING (runs on every question):
Question → embedding → similarity search → top K chunks → LLM → answer
Embeddings: meaning as coordinates
The concept that unlocked everything for me was embeddings. An embedding is a vector, nothing more than a list of numbers, that represents the semantic meaning of a piece of text. Similar meanings produce similar vectors. Dissimilar meanings produce distant vectors.
This is not keyword matching. It is geometry. When you search a vector database, you are finding the nearest neighbors in a high-dimensional space. A question about "payment processing failures" can match a chunk that talks about "error handling in transactions", even if they share no words.
The model learned these relationships from co-occurrence patterns across billions of sentences. It never "saw" what a dog looks like, but it learned that "dog" and "cat" appear in similar contexts, pet care articles, veterinary advice, adoption stories, while "car" appears in entirely different ones. That contrast is encoded into their vector coordinates: dog and cat end up geometrically close, car ends up far away.
In my project, each chunk produced a vector with 3072 dimensions using gemini-embedding-001.
The architecture
rag-project/
├── src/
│ ├── chunking.py # text splitting logic
│ ├── embeddings.py # embedding generation via Gemini API
│ ├── vector_store.py # ChromaDB setup
│ └── llm.py # prompt construction and response generation
├── main.py # orchestrates the full pipeline
└── .env # API keys
Each module exports only functions. No logic runs on import. main.py is the only place that decides what executes and in what order.
Chunking: the step most tutorials skip
Chunking is dividing your document into fragments before generating embeddings. The size matters more than I expected.
def chunk_text(text, chunk_size=400, overlap=50):
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunk = text[start:end]
chunks.append(chunk)
start = end - overlap
return chunks
The bug that taught me the most
I asked the system (in Portuguese): "O que são controllers no NestJS?" — "What are controllers in NestJS?"
The response (in Portuguese): "Não sabe." — "Does not know".
The LLM was Gemini. Gemini absolutely knows what NestJS controllers are. I had explicitly instructed it to answer only from the provided context — so when the context was wrong, it answered honestly that it did not know.
I inspected the context being sent to the model:
Controllers no NestJS são responsáveis os controllers via injeção de dependência. ("Controllers in NestJS are responsible the controllers via dependency injection.)
The chunk had been cut in the middle of a sentence. The fix was increasing the chunk size from 200 to 400 characters. The system then answered correctly.
This is the failure mode that matters in production RAG. The pipeline does not crash. It runs perfectly and produces a wrong answer. The actual problem was upstream; in the chunking strategy.
Chunk size directly affects answer quality. Too small: the embedding captures a fragment without enough semantic content. Too large: the embedding averages over too much content and loses specificity.
What I understand now that I did not before
RAG is simpler to implement than I expected. The hard part is not the code, it is the judgment. Knowing when a chunk is too small. Knowing when retrieved context is semantically close but factually irrelevant. Knowing when to restrict the LLM to context and when to let it reason freely.
The libraries abstract the mechanics. The engineering is in the decisions around them.
Retrieval quality determines answer quality. The LLM is the last step. If the chunks going in are wrong, no model in the world will produce a correct answer.
What comes next
This was a minimal implementation on purpose. The next version will index a real corpus, the parsed books of A Song of Ice and Fire, with structure-aware chunking by chapter, metadata filters by POV character and book, and conversation history for a proper chatbot experience.
After that: evals. Measuring whether the system actually answers correctly at scale is what separates a working demo from a production system.
If you are a backend engineer considering a move toward AI Engineering: start here. Build it without the frameworks first. The abstractions make much more sense once you know what they are hiding.
Top comments (5)
"One wrong answer made me dive deeper" is the exact right reaction, because that single confidently-wrong answer is the whole RAG problem in miniature - retrieval pulled something irrelevant (or missed the right chunk) and the model papered over the gap with a fluent hallucination. The failure isn't usually the LLM, it's the retrieval: chunking that splits the answer across boundaries, embeddings that match on surface similarity not meaning, no re-ranking, no "I don't have this" path. Building it from scratch is the best way to feel exactly where each of those breaks.
The fix that matters most: an abstain path. A RAG system that says "the retrieved context doesn't contain this" beats one that always answers, because one confident wrong answer destroys trust faster than ten honest "I don't knows." That grounded-or-abstain discipline is core to how I build Moonshift - a multi-agent pipeline that takes a prompt to a deployed SaaS, where a verify layer gates each step instead of trusting the model's first output. Same principle as good RAG: don't trust, retrieve-and-verify. Multi-model routing keeps a build ~$3 flat, first run's free no card. Great writeup - what did the wrong answer turn out to be, bad chunking or an embedding mismatch? That root-cause usually decides whether you fix retrieval or add a re-ranker.
That was exactly my reaction as well. Seeing the system effectively say “I don’t know” was surprisingly refreshing. In my opinion, trying to answer without enough context is far more dangerous, especially for users who don’t have prior knowledge of the topic. One confident but incorrect answer can seriously damage the trust relationship between the user and the LLM.
From what I observed, the issue seemed much closer to bad chunking than an embedding mismatch. I’m experimenting with RAG at the lowest possible level right now to understand where failures actually emerge, and chunk quality has consistently felt like the biggest lever. I’ve tested different models and embedding approaches, but if the relevant context is fragmented or poorly segmented, retrieval quality drops fast.
I’m still trying to find the “sweet spot” for chunking, but I have the feeling that once I get that right, retrieval precision will improve dramatically. That experience actually made me more interested in RAG, because seeing exactly why a system fails teaches a lot more than when everything works.
The I-don't-know being refreshing reaction is the right one, and you're attacking it at exactly the right layer. Chunk quality being the biggest lever matches everything I've seen, because retrieval is the ceiling on the whole system: if the relevant context is fragmented across chunk boundaries or buried with unrelated text, no embedding model or LLM downstream rescues it, you capped quality before generation ran. Testing different models and embeddings and finding chunking dominated is the lesson most people learn last. Two things that tend to be the sweet spot when you get there: chunk on semantic boundaries (sections, logical units) rather than fixed token counts so you stop splitting mid-thought, and a little overlap so a fact that straddles a boundary survives in at least one chunk. The other quiet win is adding a reranker over the top-k, it forgives a lot of chunking imperfection by promoting the actually-relevant passage. Learning by watching exactly why it fails is the fastest path, you're doing it right. Are you chunking fixed-size right now, or already moving toward structure-aware splitting?
Great dive into the practical aspects of building a RAG pipeline! Chunking can indeed be tricky—did you face any challenges with data preprocessing?
Thanks! Honestly, with this sample size I didn't face meaningful preprocessing challenges; the text was clean enough that chunking was the real bottleneck.
But that's exactly why in the next version I'll indexes the full A Song of Ice and Fire corpus (~50k paragraphs extracted from EPUBs). I expect preprocessing to become a real concern there, especially around chapter boundaries and POV detection. Will document everything when I get there.