I've been working on a project to audit distributed hardware infrastructure — devices
spread across multiple sites, each running firmware that needs to stay compliant with a
central policy. Pretty standard enterprise ops problem.
My first instinct was RAG. Everyone reaches for RAG. You embed your documents,
stand up a vector store, and your agent can reason over your data. I've built RAG
pipelines before, they work well, so I started there.
Three days in, I switched direction.
The moment I realized RAG wasn't the right fit
I was testing the agent against a scenario where a device had failed a firmware check at
2am. The agent reported it as compliant.
The problem wasn't the model. The problem was that the data the agent was reasoning
over was from an embedded snapshot I'd generated two days earlier. The device had
drifted since then. The vector store didn't know — it can't know. It's a snapshot by
design.
That works fine for a documentation assistant. For infrastructure audit it's a problem,
because you need to know what's happening now, not what was true when you last ran
the embedding pipeline.
What I needed wasn't retrieval — it was access
Here's the reframe that changed how I thought about this.
RAG answers the question: what documents are relevant to this query?
What I actually needed to answer was: what is the current state of device X right now?
Those are different questions. One is a search problem. The other is a database query. I
was using the wrong tool.
The inventory — firmware versions, device health, site assignments — lives in a SQLite
database. The compliance policy lives in a structured text file. Neither of these is a
document in any meaningful sense. Chunking them and embedding them into a vector
store was me forcing square data into a round hole because that's what I knew how to do.
server that exposes it as tools the agent can call:
• get_inventory() — returns live device state, current to the second
• query_policy() — reads the policy file and returns the requirements
• flag_violation() — marks a device non-compliant with structured metadata
The agent calls these the same way your application code calls an API. No embedding
pipeline. No staleness problem. No guessing at similarity scores for what is
fundamentally a structured query.
The gateway nobody talks about
One thing I'd push back on in most agent tutorials — they wire the LLM directly to the
frontend and call it done.
I put a FastAPI gateway in between, and I'd do it again every time.
The practical reason: NVIDIA NIM credits aren't free. A misconfigured client or a
runaway loop can drain your quota in minutes if there's nothing between the UI and the
model. The gateway enforces rate limits per IP before a single token is generated.
Saved me actual money during development.
The better reason: not every query needs the full audit agent. Simple questions — how
many nodes are in Bellevue? — don't need a multi-step LangGraph agent burning
Gemini 2.5 tokens. The gateway classifies intent and routes accordingly. Simple queries
go to a lighter NIM worker. Full compliance audits go to the Gemini agent.
It also centralises auth and logging in one place, which matters when you need to show
a security team exactly what the agent did and when.
The Judge
This is the piece I'm most glad I built, and the one I almost skipped.
Every response — whether it came from the NIM worker or the Gemini agent — passes
through a secondary LLM before it reaches the user. I call it the Judge. Its only job is to
read the agent's output, check it independently against the policy file, and decide
whether the reasoning holds up.
During testing, the Judge caught something the main agent missed. The agent had
correctly identified a non-compliant firmware version, but applied a remediation rule that
belonged to a different device category. The logic was sound — it just used the wrong
rule. The Judge caught it because it reads the policy independently, without inheriting
whatever context the main agent had accumulated during its reasoning loop.
That independence is the point. If the Judge just re-reads the agent's own context, it's
not really checking anything. You want it reading from the source, fresh.
Humans stay in the loop
The agent can suggest remediation — here's the CLI command to fix the firmware drift
on node 7. It cannot run it.
There's a hard gate in the LangGraph state machine. Suggest remediation and execute
remediation are separate nodes, and the only path between them runs through a human
decision in the UI. An architect clicks Approve. Then and only then does the write
operation touch the database.
For infrastructure this felt like the right call. The cost of a false positive — a remediation
that runs when it shouldn't — is much higher than the cost of an extra approval click.
What I'd do differently
Two things.
I'd instrument RAGAS metrics from day one. I ended up retrofitting evaluation on the
agent's audit outputs and found gaps I'd been manually poking at for weeks.
Faithfulness and context relevancy scores would have surfaced those faster.
And I'd write the red-team report in parallel, not after. I know what failure modes the
Judge catches now, but I reconstructed most of that knowledge from memory rather
than documenting it as I found it. A live failure log from the start would've made that
report much sharper.
The short version
RAG is the right tool for knowledge retrieval over static content. It's a less natural fit
when your agent needs to query live structured data and act on what it finds.
MCP let me give the agent real database access through a typed tool interface — no
embedding pipeline, no staleness, no similarity search on what is fundamentally a
relational query. For infrastructure audit, that was the right call.
Code is on GitHub if you want to dig into the architecture. Happy to go deeper on the
LangGraph state machine or the Judge design in the comments.


Top comments (0)