NorthernDev

Posted on Dec 16, 2025

STOP GUESSING: The Observability Stack I Built to Debug My Failing AI Agents

#rag #webdev #ai #postgres

The RAG pipeline is a black box. I got tired of guessing why my bot retrieved the wrong context, so I built an engine for reliable, observable vector retrieval and semantic content verification.

RAG and LLM verification are the new bottlenecks in AI development. I built MemVault (for reliable Hybrid Vector Retrieval) and ContextDiff (for deterministic AI Output Verification). The problem is observability; here are my solutions.
STOP GUESSING: The Observability Stack I Built to Debug My Failing AI Agents
We are all integrating LLMs, but we rarely talk about the biggest challenge: the silent failure modes in RAG (Retrieval-Augmented Generation).
When a bot gives a wrong answer, where did it fail?

Did the vector search miss the key context?
Did the embedding model misinterpret the user's query?
Did the LLM output subtly change a critical fact from the source material? Staring at JSON logs and vector IDs is not scalable. I spent 2024 struggling with this, so I shifted my focus to building tools that inject deterministic analysis and observability back into the AI pipeline.

Tool 1: MemVault – The Observable Memory Server
I built MemVault to solve the complex retrieval integrity problem. Setting up dedicated vector databases is overkill for many projects, so I designed MemVault as a robust, open-source Node.js wrapper around the reliable stack we already use: PostgreSQL + pgvector.

1. Hybrid Search 2.0: The End of Guesswork
Most RAG pipelines use only semantic search, which is brittle. MemVault ensures reliability with a weighted 3-way hybrid score:

Semantic (Vector): Uses Cosine Similarity via pgvector to understand meaning (50% weight).
Exact Match (Keyword): Uses BM25 (Postgres tsvector) for finding specific IDs or error codes that vectors miss (30% weight).
Recency (Time): A decay function prioritizing recent memories (20% weight).

2. The Visualizer: Debugging in Real-Time
Debugging RAG is hard. MemVault eliminates this by offering a dashboard to visualize the vector search in real-time. You can instantly see why a specific document was retrieved and what its weighted score was.

MemVault Live Demo:https://memvault-demo-g38n.vercel.app/

3. Setup: Choose Your Economic Reality
MemVault is designed to be developer-first, offering high performance regardless of budget:

Self-Host (MIT License): Run the entire stack (Postgres + Ollama for embeddings) 100% offline via Docker. Perfect for privacy and zero API bills.
Managed API (RapidAPI): Use our hosted service to skip maintenance and infrastructure setup (Free Tier available). Quick Start (NPM SDK) npm install memvault-sdk-jakops88

Tool 2: ContextDiff – Semantic Output Validation
If MemVault ensures you retrieve the right context, ContextDiff ensures the LLM doesn't ruin it.
This tool solves the Output Integrity problem: how do you verify that AI-generated text has not subtly changed facts or tone compared to the source material?

1. Deterministic Semantic Verification
ContextDiff is a production-ready FastAPI/Next.js monorepo that performs LLM-powered comparison, providing a structured assessment:

Risk Scoring: An objective 0-100 risk score and a safety determination.
Change Detection: Flags specific change types with reasoning:
- FACTUAL: Critical claims or certainty levels changed (e.g., "will" vs. "might").
- TONE: Sentiment or formality shifted.
- OMISSION/ADDITION: Information was dropped or introduced.

2. Why Simple Diff Fails
Simple diff tools are useless for AI. ContextDiff detects that "Q1 2024" changing to "early 2024" is a semantic change in certainty (a risk), not just a string difference.

Use Case: High-stakes content validation (Legal, Medical, Finance) where maintaining the semantic integrity of the source is mandatory.
Contextdiff demo:https://context-diff.vercel.app/

Conclusion: Stop Debugging in the Dark
The future of reliable AI engineering hinges on observable, verifiable systems. If you're tired of treating your RAG pipeline as a black box, I encourage you to explore these tools.

Check out the MemVault source code: https://github.com/jakops88-hub

Try the ContextDiff API for output validation. Which problem are you struggling with most right now: slow retrieval (RAG) or unreliable output (Validation)? Let me know in the comments. Find the full ContextDiff repository on GitHub.

Top comments (6)

rokoss21 • Dec 16 '25

Really strong post — and I think you’ve nailed the core issue: agents don’t fail because the LLM is “bad,” they fail because the surrounding system is blind.

RAG without observability is a black box. If you can’t answer why a specific piece of context was retrieved and how it was transformed, debugging turns into guesswork. MemVault’s hybrid retrieval plus ContextDiff’s semantic validation is exactly the kind of engineering rigor most AI pipelines are missing today.

The emphasis on deterministic analysis really resonates. Without reproducibility, optimization is mostly an illusion — you can’t reliably improve what you can’t replay.

I’ve been exploring a similar direction from the system side: treating contracts and determinism as a first-class layer for agent pipelines, down to canonical JSON and reproducible transformations. I’m experimenting with this openly here, if it’s of interest: FACET github.com/rokoss21/facet-compiler

Great work overall. This is the kind of tooling that actually moves the industry forward — away from prompt magic and toward real, observable engineering.

NorthernDev • Dec 17 '25

Thanks, Rokoss. I couldn't agree more with your point: Optimization is an illusion if you can't replay it. That is exactly why we had to build deterministic tools like MemVault.
I’m highly interested in your work on FACET. Treating determinism and contracts as a first-class layer for agents is the only way to escape the current prompt engineering hell. I'll be checking out the repository today, and I look forward to seeing how we might align on concepts like canonical JSON and reproducible transformations.
Thanks for sharing and for the strong validation!

Nick Goldstein • Dec 16 '25

Great read! Thinking through how to ensure good responses from LLMs has been a pain point for me for awhile. This is a nice reframe tbh

NorthernDev • Dec 17 '25

Thanks so much, Nick! That "nice reframe" is high praise, it's exactly what I was going for.
It really feels like we all just got used to guessing in the RAG pipeline, instead of getting deterministic answers.
If you try the demos for ContextDiff or MemVault, please let me know how they hold up in your real-world projects!

Pomelitros • Dec 16 '25

Interesting

NorthernDev • Dec 16 '25

Thank you!