DEV Community: DtoTHEmoon

RAG vs Agent: The Decision That Broke My System (And How I Now Enforce It Upfront)

DtoTHEmoon — Mon, 01 Jun 2026 02:14:44 +0000

Most people treat the RAG-vs-Agent question as a technical preference. Pick whichever feels right, adjust later.

I did that. It cost me two full rebuilds.

Here's the decision framework I've landed on — and the tool I built to enforce it before the first line of code gets written.

The Mistake: Treating Architecture as Reversible

I was building GrowthOS, a four-module internal talent development platform. When I hit module three — personalized learning path generation — I reached for RAG out of habit. I'd just built a solid RAG knowledge base in module one. The pattern was familiar.

Six days in, I had a retrieval system that could surface relevant learning materials. What it couldn't do:

Read an employee's current skill profile
Analyze which specific gaps needed closing
Decide the optimal sequencing given available time
Monitor whether the employee's behavior changed after completing a path
Trigger re-planning when skills shifted

RAG returned documents. The task required decisions across time. I had picked the wrong primitive, and the cost was a rebuild.

The deeper problem: I had no forcing function that made me answer the architecture question before building.

The Decision Framework

After rebuilding twice, I reduced the RAG-vs-Agent decision to three diagnostic questions:

Question 1: Is this a retrieval task or an execution task?

RAG is fundamentally a retrieval primitive: given a query, find and synthesize relevant content. It's excellent when the output is information.

Agent is an execution primitive: given a goal, take a sequence of actions using tools. It's necessary when the output is a decision or a state change.

The confusion happens because modern RAG pipelines can feel agentic — they chunk, embed, retrieve, rerank, generate. But all of that complexity is still in service of answering a question, not executing a workflow.

Question 2: Does the task require maintaining state across multiple steps?

If yes, you need Agent.

RAG is stateless by design. Each query is independent. You can build workarounds — storing context, chaining queries — but you're fighting the architecture.

Agent is stateful by design. It maintains context, tracks intermediate results, and can loop back based on what it finds.

For GrowthOS module three, the path generation workflow looked like this:
read_profile(employee_id)
→ analyze_skill_gap(profile, target_role)
→ search_materials(gap_list)
→ generate_path(gaps, materials, available_time)
→ monitor_progress(employee_id, path) ← runs continuously
→ trigger_replan(if behavior_signal_detected)

Each arrow is a tool call that depends on the result of the previous one. This is Agent territory, not RAG.

Question 3: What is the cost of getting this wrong?

RAG failure modes are usually visible and recoverable: the answer is wrong or incomplete, the user notices, you fix the retrieval. Time cost, not catastrophic.

Agent failure modes can be silent and compounding: the agent takes the wrong action, downstream steps build on that error, you find out six steps later. Or you don't find out until a user hits it in production.

This asymmetry should directly affect how much upfront rigor you apply to the architecture decision. The higher the cost of failure, the more you need to be certain before you build.

The GrowthOS Module Breakdown

Running all four modules through this framework makes the pattern clear:

Module	Task Type	Stateful?	Failure Cost	Decision
Module 1: Knowledge base	Answer questions about docs	No	Low (visible)	RAG
Module 2: Skill profiling	Compute tags from behavior events	No (batch job)	Medium	Rules engine
Module 3: Learning paths	Generate + monitor + replan	Yes	High (silent drift)	Agent
Module 4: Tracking + flywheel	Detect signals, update weights	Partial	Medium	Hybrid

The interesting case is module two. You might expect a skill-tagging system to use RAG or Agent, but the task is actually deterministic: behavior events map to skill weights via defined rules, decay runs on a schedule, nothing requires LLM inference. A rules engine with a cron job is more reliable and cheaper than an LLM call for every event.

Over-reaching for AI where deterministic logic is sufficient is one of the most common and expensive mistakes in production systems. The question isn't "can AI do this?" but "does this task actually require AI?"

The Enforcement Problem

Knowing the framework doesn't help if you don't apply it at the right moment. The right moment is before you write any code — at the point where the architecture is still a decision, not a sunk cost.

In practice, most developers (myself included) reach the architecture question after they've already started building. The pattern looks like:

Start implementing a feature
Realize something isn't working
Debug for hours
Eventually diagnose a fundamental architecture mismatch
Rebuild

What I needed was something that forced the decision earlier — ideally the moment I started describing a new module or feature, before the first tool call.

This is the problem Rein is designed to solve.

How Rein Enforces Upfront Architecture Decisions

Rein is an open-source Skill for Claude Code that monitors your development conversations and intervenes at specific diagnostic moments.

For architecture decisions, Rein's Q1 layer (SPEC) enforces a constraint: before any implementation work begins on a feature involving data retrieval or automated decision-making, the SPEC must answer:

What is the output type? (information vs decision vs state change)
Does the task require state across multiple steps?
What is the failure mode and its cost?
Which primitive does this map to: rules engine / RAG / single Agent / multi-Agent?

If you start describing an implementation without these questions answered, Rein surfaces them. Not as a checklist — as targeted questions based on what you've described.

The second enforcement point is Q4 (verification scripts). Architecture decisions aren't just written down; they're verified. Before module three was considered "done," verify.sh included:

check "PathAgent tool list matches SPEC" \
  "grep -c 'def get_employee_profile\|def analyze_skill_gap\|def search_learning_materials\|def generate_learning_path\|def monitor_progress' agent/path_agent.py | grep -q '^5$'"

check "MonitorAgent runs on schedule" \
  "grep -q 'monitor_agent\|schedule\|cron' backend/main.py"

If the implementation drifts from the SPEC, the gate fails. You find out immediately, not in production.

The Silence Rule

One design principle worth noting: Rein is silent when there's nothing to flag.

This matters because most Harness tooling errs toward verbosity — warning about everything, asking for confirmation constantly, inserting itself into every decision. The overhead degrades the development experience until you start ignoring it.

Rein's trigger conditions are narrow and specific. For architecture decisions:

Trigger: you describe a new feature involving retrieval or automated decisions, without a SPEC that answers the three diagnostic questions
No trigger: you're implementing a feature with a clear SPEC already written
No trigger: you're debugging, refactoring, or working on UI

In the 16-scenario benchmark, Rein triggered on 100% of cases where intervention was warranted and stayed silent on 100% of cases where it wasn't. The silence test is as important as the trigger test.

The Practical Takeaway

If you're building an AI system and haven't explicitly answered these three questions for every component, you're accumulating architecture debt that compounds:

Is the output information, or a decision/state change?
Does the task require state across multiple steps?
What's the cost if this is wrong?

The answers don't have to be permanent — architectures evolve as requirements change. But they need to exist before you build, not after you've rebuilt twice.

RAG and Agent are not interchangeable tools on a gradient. They're different primitives for different problem shapes. Getting the match right early is one of the highest-leverage decisions in AI system design.

Rein is open source: github.com/DtoTHEmoon/rein-skill

Install:

git clone https://github.com/DtoTHEmoon/rein-skill.git ~/.claude/skills/rein

[Boost]

DtoTHEmoon — Thu, 28 May 2026 02:03:01 +0000

DtoTHEmoon

May 27

Why Your AI Agent Keeps Making the Same Mistakes (It's Not the Model)

#ai #claude #agentaichallenge #chatgpt

3 min read

Why Your AI Agent Keeps Making the Same Mistakes (It's Not the Model)

DtoTHEmoon — Wed, 27 May 2026 23:28:49 +0000

Does this sound familiar?

Your AI just fixed a bug. Two weeks later, the exact same bug is back.

You deploy something, and you have no idea if it actually worked — so you manually test it.

You've written 100 lines of rules in your config file, but the AI still ignores half of them.

Every new chat session, you re-explain the same context from scratch.

I ran into all four of these problems while building an internal AI quoting system for a healthcare company — with no technical background. And after months of debugging, I realized: none of these were model problems. They were Harness problems.

What is Harness Engineering?

Harness Engineering is the discipline of building the scaffolding around your AI — the rules, constraints, verification scripts, and knowledge structures that make it produce consistent, reliable output.

Without Harness, even the best model will drift, forget, and repeat the same mistakes.

The data backs this up: research shows that 80% of Agent quality failures come from Harness gaps, not model limitations. And in one benchmark, the same 15 models all improved significantly when only the Harness changed — not the models themselves.

The problem is: most people don't know what their Harness is missing. They just know something feels broken.

The framework: two dimensions, not six steps

After studying real production failures and building my own system from scratch, I organized Harness Engineering into two dimensions.

Vertical Quality Layers (Q) — required for every project

Layer	Name	What it solves
Q1	SPEC	AI knows what to build, what not to, and how to verify
Q2	Rules + Security	Hard business limits + security red lines, equally mandatory
Q3	Skills	Repetitive workflows standardized with counter-examples
Q4	Scripts (unified gate)	Nothing is "done" until scripts pass

Horizontal Scale Layers (S) — enable only when needed

Layer	Name	When to enable
S1	Context	Sessions losing coherence after ~20 turns
S2	dev-map + Memory	Project iterating 2+ months, AI re-inventing solutions
S3	Multi-Agent	Single agent consistently failing on long task chains

The key insight: Q4 is not step four. It's the exit gate for every layer. Code changes, doc updates, multi-agent outputs — all must pass Q4 before anything counts as done.

Most people skip Q4 entirely. That's why the same bug keeps coming back.

What I built: Rein

Rein is an open-source Skill for Claude Code (and any agent supporting the SKILL.md standard) that acts as a silent Harness Engineering advisor throughout your project.

It watches your conversations for patterns — not keywords — and speaks up only when it detects a real gap. When everything's fine, it stays silent. Silence is a feature.

What it detects automatically:

Repeated failures (same bug fixed twice → missing Rule or regression test)
Context loss (re-explaining background every session → incomplete project docs)
Scale shifts (internal tool going external → time to harden your Harness)
Cost spikes (API bill climbing → identifies token waste sources)
Over-engineering (more config, slower shipping → tells you what to delete)

Test results: 97% pass rate across 16 scenarios with Rein vs 52% without.

The biggest gap was in root cause diagnosis: 92% accuracy with Rein, 24% without.

A real example from my project

My verify.sh only checked if the service started. It didn't check if the business logic was correct.

So when the AI "fixed" a pricing calculation bug, it passed my verification — service was running — but the actual calculation was still wrong. Same bug, two weeks later.

After adding a business baseline check (call a known correct quote request, compare against expected output), that class of bug disappeared entirely.

This is Q4. Not just "is the service alive?" but "is the output actually correct?"

Install

git clone https://github.com/DtoTHEmoon/rein-skill.git ~/.claude/skills/rein

Restart your agent. Rein activates automatically — no commands needed.

Also works with: OpenClaw, Codex CLI, Gemini CLI, Cursor, and any agent supporting SKILL.md.

The core philosophy

Start minimal. Add only when you have a real pain point. And know when to subtract — Rein will tell you when your Harness is getting in your own way.

If your scaffolding is slowing you down, it's time to cut.

GitHub: github.com/DtoTHEmoon/rein-skill