Ray

Posted on Jun 9

My server pushes hints to agents — and the 3 iterations that led there

#ai #llm #mcp #agentskills

I avoided MCP from day one. No schema overhead, no token tax. The agent called my GraphQL API directly with a behavior spec and good documentation. I assumed that was enough: clear docs, correct architecture, let the agent figure it out.

It wasn't. The moment that changed my thinking: watching the agent burn 1,500 tokens on a single upload because it kept guessing JSON field formats wrong, reading docs across multiple pages, and retrying. I fixed the docs. The problem resurfaced on different fields. I fixed those too. It kept coming back. CLI wasn't a nice-to-have. It was the only thing that actually stopped the bleeding. And it was only the first of three iterations before I stumbled into the most interesting one: letting the server talk directly to the agent.

Iteration 1: Avoiding MCP from day one

The MCP context tax is well-documented at this point, so I won't belabor it. My API surface covers 34 commands. As MCP tools, that's 34 schemas × ~180 tokens = ~6,120 tokens of constant overhead in every conversation turn, regardless of whether the agent uses them.

I understood this from first principles and chose a different path: a SKILL.md behavior spec + direct GraphQL API calls. No registered tools, no schema overhead. The agent reads the behavior spec once when the skill is invoked, then calls the API via curl.

Architecture: correct. Problem: solved?

No.

Lesson: Making the right architectural choice doesn't mean the agent behaves well. The real work starts after the architecture is in place.

Iteration 2: From raw GraphQL to CLI

The agent called my GraphQL API directly. Complex fields (provenance metadata as nested JSON, blocker objects, model config) required it to assemble raw JSON payloads in curl commands.

It guessed wrong constantly. One wrong field type → GraphQL error → agent reads the API docs to figure out the correct format → docs are detailed and split across multiple pages → 2+ page fetches per retry attempt. A single operation that should cost ~200 tokens was burning 1,500+ in error-recovery loops.

I tried fixing the docs. Made them more precise, added inline examples, consolidated pages. The problem kept resurfacing on different fields. Every time I fixed one, another appeared.

The insight: the issue wasn't documentation quality. It was that raw APIs force agents to assemble structures without type safety, and LLMs are fundamentally bad at this.

The fix:

# Before: agent assembles raw JSON in curl
curl -X POST /graphql -d '{"query":"mutation { uploadAsset(input: { shotId: \"...\", type: \"start_frame\", provenance: { method: \"ai_generated\", model: \"gpt-image-2\", prompt: \"...\" } }) { id } }"}'

# After: typed CLI arguments, zero JSON assembly
python3 nl.py upload <shotId> start_frame frame.png --method ai_generated --model "gpt-image-2" --prompt "Winter city street"

The CLI dispatcher routes 34 commands through typed arguments. The agent doesn't guess field types or assemble nested objects. It passes flags.

A bonus: the --json flag gives the agent structured data for reasoning, while the default gives a human-readable table. One CLI, two audiences:

# For the agent: structured JSON for parsing
python3 nl.py overview <noteId> --json

# For the developer watching: readable progress
python3 nl.py overview <noteId>
# Episode 01: The Algorithm Hunter
#   [===done===|--review--|......not_started.......] 3/12
#   Shot   Status       Rolls    Best   PF
#   01A    done         3        48     Y
#   01B    review       2        41     Y

Lesson: Don't make your agent assemble what you can pre-structure. CLI arguments are inherently type-safe for LLMs. If your agent is doing error→doc→retry loops, the fix isn't better docs. It's eliminating the assembly step entirely.

Iteration 3: The pause-and-reflect methodology

CLI fixed execution. But the agent still made bad decisions. It would re-roll a rejected video without changing the prompt first. It would write new prompts without checking what had already been tried. It would skip the preflight check and waste a generation on incomplete assets.

These weren't execution failures. They were judgment failures. The agent did what I asked correctly, but chose wrong actions.

I stopped production and asked the agent:

"Stop. Before we continue, did you have a lot of inefficient actions just now? Which were because the docs or skill spec aren't clear enough? And is there a recurring scenario where a new API that gives you everything at once would've helped?"

The agent pointed to specific gaps in my SKILL.md:

No explicit rule saying "never re-roll without changing something"
No guidance on checking past insights before writing new prompts
Missing decision thresholds (what score means "fix and retry" vs "debug first"?)

I patched those gaps. Ran more production. Stopped again. Asked again. Each cycle surfaced new blind spots in the behavior spec.

This isn't a one-time audit. It's a repeating feedback loop:

produce → reflect → polish spec → produce → reflect → ...

Lesson: Your agent is both the consumer and the best auditor of your behavior spec. It knows exactly where the spec failed it. You just have to stop and ask, then actually fix what it tells you.

Server-pushed guidance

Even after several reflect cycles, one class of failures persisted, and it taught me the most interesting lesson.

The agent wrote a video generation prompt but didn't reference any of the uploaded assets. The generated video had nothing to do with the reference frames sitting right there in the project. The assets existed. The agent just… forgot to use them.

After the failure, I asked: "If there had been a message right before you wrote the prompt listing the available assets and how to reference them, would you have caught this?" The agent said yes.

So I built it. When the server detects that preflight has passed but the active prompt contains no @filename references, it injects a hint listing every available asset:

// Agent wrote a prompt but didn't reference uploaded assets
ctx.pendingHints.push({
  type: "available_refs",
  priority: "high",
  message: `Available refs for prompting: ${refs.map(r => `@${r.filename} (${r.assetType})`).join(", ")}`,
  metadata: { targetId: shot.id, refs },
});

Same method, different failure: the agent uploaded all assets, preflight passed, but it forgot to advance the shot status from asset_prep to ready. Another hint, born from the same question: "would a nudge here have prevented this?"

// Preflight passed but agent forgot to advance status
ctx.pendingHints.push({
  type: "ready_to_advance",
  priority: "high",
  message: "Preflight passed but shot is still in asset_prep. Update status to ready.",
  action: `nl.py shot-update ${shotId} --status ready`,
});

Every high-value hint in the system was designed this way: agent fails → I ask "what hint would have prevented this?" → I build the trigger.

Notice something about these hints: they're not written for me. They're written for the agent. The messages contain CLI commands, @filename conventions, status values, all the agent's working vocabulary. This isn't a notification system for humans. It's the server talking directly to the agent in its own language.

And the best part: even if the agent ignores a hint and makes the mistake anyway, the hint is already sitting in its context window. When the agent enters debug mode after the failure, it naturally recalls "there was a hint about this." The hint doesn't need to be obeyed to be useful. It just needs to exist in context.

Hints travel as extensions.agentHints in every GraphQL response. On the client, they route to stderr so they don't contaminate the JSON the agent is parsing:

hints = result.get("extensions", {}).get("agentHints", [])
if hints:
    for h in hints:
        mark = {"high": "!", "medium": "*", "low": "~"}[h["priority"]]
        print(f"  [{mark}] {h['message']}", file=sys.stderr)

There's also a pool of probabilistic hints for general best practices, but the high-value ones are all reverse-engineered from specific agent failures, at zero extra database cost, because they're generated from data the mutation already loaded.

Lesson: Design hints backwards: from failure to trigger, not from architecture to feature. And write them for the agent, not for yourself.

The Architecture That Emerged

Three layers, each discovered through a different failure mode:

Layer	Solves	Freedom
SKILL.md	Wrong decisions	High: what to do, when, why
CLI scripts	Assembly errors	Low: exact operations, zero guessing
agentHints	Forgotten context	Reactive: server speaks when relevant

Remove any one layer and the agent drifts. The spec without hints means forgotten context. Hints without spec means no decision framework. CLI without either means correct execution of wrong decisions.

I'm building Narrative Lion, a research tool for content creators that turns the videos you study into a Playbook your AI can actually use. The agent architecture described here runs its production pipeline. Check it out at narrativelion.com

Curious how others handle this: do you push guidance from the server, or keep everything in the behavior spec?