Tariq Davis

Posted on May 31 • Edited on Jun 1

I Built a Tool That What Actually Happens When You Let an AI Agent Hunt Your Own Blind Spots

#hermesagentchallenge #devchallenge #agents #ai

Hermes Agent Challenge Submission: Write About Hermes Agent

This is a submission for the Hermes Agent Challenge: Write About Hermes Agent

The Tool That Investigated Me While I Was Building It

I didn't plan ECHO Hunt.

It came out of a simple question I kept sitting with: when you build something with AI, did you actually learn anything — or did the AI just carry you through it and you called it progress?

That question has weight when you're building alone. No CS degree, no bootcamp, no senior developer to tell you what you missed. Just you, an AI, and a string of errors that may or may not mean anything. That's how I build. That's how most people building with AI today build. And nobody has a tool that tells you the truth about what actually happened in a session.

So I built one.

Where the idea actually came from

Vibe coding is how I build. How most people build with AI now. You describe what you want, the AI generates it, you run it, fix errors, iterate. You ship something real. But there's a problem nobody talks about: you have no idea what you actually learned vs what the AI just did for you.

The forensic framing came naturally. I study Digital Forensics. My research group spent months building an academic proposal on IoT cybersecurity challenges in the Caribbean. I know what it means to investigate something without the right tools, without the right lab, without the institutional support that makes forensic methodology accessible.

That gap between what the framework assumes and what actually exists — that's the same gap between what vibe coding produces and what the builder actually retains.

ECHO Hunt applies forensic investigation logic to a vibe coding session. Not as a metaphor. As a structure.

The cognitive TTPs

Before I built anything, I needed a framework for what "blind spots" actually means. I didn't want a vague "learning reflection" tool. I wanted something that could name what went wrong the way a forensic report names evidence.

Four cognitive TTPs — Tactics, Techniques, and Procedures, borrowed directly from MITRE ATT&CK framing:

Borrowed Confidence — you accepted AI output without verifying it
Shallow Resolution — you fixed the error without understanding why it existed
Pattern Blindness — you repeated the same error class multiple times without recognizing it
Premature Exit — you moved on before your understanding was solid

These aren't personality judgments. They're patterns of behavior that appear in session logs. Observable. Huntable.

Why Hermes Agent specifically

The build needed something that could reason across an entire session log — not just summarize it, but investigate it. Form hypotheses before analyzing. Hunt each one against the evidence. Return structured output that the game layer could use without any additional AI calls during gameplay.

That's an agentic task, not a prompt task. A single LLM call can summarize. An agent can investigate.

Hermes Agent's skills system made this real. I created a custom skill called echo-hunt — procedural memory that Hermes loads and executes as a reusable investigation procedure. One call via hermes -z and the entire hunt runs: hypotheses formed, evidence collected, TTPs mapped, attribution challenges generated with locked correct answers and plausible distractors.

Everything pre-computed before gameplay starts. Zero AI calls during the investigation. The game runs entirely on cached data.

The part I didn't expect

I built the game layer to eliminate hallucination on the Evidence Integrity score. The first version of ECHO had Hermes generating a number — and it kept changing on identical inputs. Meaningless.

The fix was architectural. Remove the generated score entirely. Calculate it from player behavior instead:

Each TTP attribution you get right or wrong
Each pre-hunt declaration that matches or misses Hermes's findings
Each confirmed finding weighted by severity

Hermes provides the evidence. You interact with it. Your decisions produce the number.

That shift — from AI-generated to player-computed — is what made ECHO Hunt honest. The score isn't Hermes's opinion of your session. It's a record of how accurately you read your own blind spots.

What Hermes found about me

I ran ECHO Hunt on the session where I built ECHO Hunt.

Shallow Resolution [MODERATE]: Configuration issues were handled by repeatedly replacing files rather than analyzing why the specific settings were failing.

Borrowed Confidence [HIGH]: Acceptance of a large-scale UI rewrite immediately following a minor skill update, assuming the logic was correct without testing.

Premature Exit [LOW]: Using a wait-time heuristic to resolve a loading screen issue instead of implementing a proper readiness check.

The confirmed finding that hit hardest:

"The sequence of 'still not working' → 'try changing format' → 'config is getting corrupted' → 'paste in a clean config' shows a lack of diagnostic precision."

That's not a generated critique. That's evidence from my own session. I spent hours trying to get the Hermes gateway API server running on port 8642. Every attempt fixed the surface. None of them diagnosed the root cause. Eventually I pivoted to hermes -z as a CLI approach — which worked immediately. Shallow Resolution, confirmed.

The tool I built to catch blind spots caught mine. That's either ironic or exactly the point. Probably both.

The cultural layer

There's no forensic lab in Jamaica built for this. No senior dev ecosystem, no bootcamp pipeline that hands you a structured environment and says "learn here." What exists is curiosity, whatever tools you can access, and the willingness to figure it out.

That's just the reality. And it shaped everything about how ECHO Hunt was built and who it's for.

The self-taught builder outside formal systems doesn't need another tutorial. They need something that tells them the truth about what happened in the session they just finished — not what the AI thinks they learned, but what the evidence shows. The gap between those two things is where real growth either happens or doesn't.

ECHO Hunt exists in that gap. It treats your thinking like it matters enough to investigate. Not because you went to the right school or work at the right company — but because you showed up, built something, and deserve to know what you actually walked away with.

That's what a tool for the Caribbean builder looks like. That's what I wanted to make real.

🎥 Watch the full demo

🔗 github.com/FlowArchitect895/echo-hunt

Top comments (2)

Harjot Singh • May 31

An agent that hunts your blind spots is a great use of the technology precisely because it inverts the usual problem - you're not asking it to be right, you're asking it to be a tireless devil's advocate, and "surface things I'm not seeing" is a task where even imperfect output is valuable (a false positive costs you 30 seconds, a caught blind spot saves you weeks). The asymmetry makes it forgiving in a way most agent tasks aren't.

The design subtlety: a blind-spot hunter is only useful if it's adversarial by default, because the base model's instinct is to agree with you (sycophancy again). You have to engineer it to push back, assume you're wrong, and look for the uncomfortable thing - the opposite of its training. That deliberate adversarial framing is the same reason I run independent verification/critic passes in Moonshift (a multi-agent pipeline shipping a prompt to a real SaaS) rather than trusting the agreeable first answer. Cool concept - how did you stop it from just validating your existing assumptions? Getting an LLM to genuinely disagree with you is harder than it sounds.

Tariq Davis • Jun 1

The ordering is what does it. Hypotheses form before the session gets analyzed, which means the agent is already committed to what it expects to find before it can be swayed by what's actually there. That forces hunting mode instead of summary mode.

The declaration layer adds another constraint on my end. I commit to my blind spots before seeing what Hermes found. So even when the agent misses something, the gap between what I declared and what it surfaced is still useful. False positives are cheap. Caught blind spots aren't.

The sycophancy problem doesn't get solved by prompting, it gets bypassed by structure.

But I am curious about the independent critic pass in Moonshift, that's exactly the direction I'd take this for longer sessions. Pretty cool