Nazar Boyko

Posted on Jun 27

Inside An AI Agent: Planning, Tool Use, Memory, Constraints, And Verification

#ai #agents #planning #tools

Have you noticed how every demo of "an AI agent" looks impressive in the video and falls apart the moment you ask a sharper question?

The agent confidently does the wrong thing. It forgets what it just decided. It tries to call a tool that doesn't exist. It loops forever rewriting the same file. It calmly tells you the deployment succeeded when it didn't.

These aren't failures of the model. They're failures of the workflow around the model.

Because that's all an agent really is: a software workflow where a language model can pick the next step and call tools. The "intelligence" sits in the prompt and the orchestration around it, not in some secret agent-flavoured fairy dust. Strip the word "agent" away and you've got five pieces of plumbing: planning, tool use, memory, constraints, verification. Every production-grade agent stands or falls on those five.

This is a long walk through each one. Not the marketing version. The kind of detail you actually need before you ship something that talks to your database.

The Loop You're Actually Building

Before we touch any pillar individually, hold the whole loop in your head.

A useful agent does roughly this on every turn:

Read the goal (and whatever memory is relevant to it).
Decide the next action: answer directly, call a tool, ask a clarifying question, or stop.
If it called a tool, observe the tool's result and feed it back in.
Update memory if anything is worth remembering.
Check constraints: are we over budget, out of iterations, touching something off-limits?
Verify the output before declaring success.
Loop until done or stopped.

That's it. Every framework (LangGraph, OpenAI Agents SDK, Claude Agent SDK, smolagents, whatever ships next month) is a different shape of the same loop with different defaults.

agent-loop.ts

async function runAgent(goal: string, ctx: AgentContext) {
  const state = ctx.startState(goal);

  for (let step = 0; step < ctx.maxSteps; step++) {
    const decision = await ctx.model.decide(state);

    if (decision.kind === "final") {
      const verified = await ctx.verifier.check(decision.output, state);
      if (verified.ok) return verified.output;
      state.observations.push(verified.feedback);
      continue;
    }

    if (decision.kind === "tool") {
      ctx.guard.assertAllowed(decision.toolName, decision.args);
      const result = await ctx.tools.run(decision.toolName, decision.args);
      state.observations.push(result);
      state.memory.maybeStore(decision, result);
    }
  }

  throw new Error("agent: max steps exceeded");
}

Look at that loop carefully. Every interesting bug in agent systems lives in one of those five method calls: decide, check, assertAllowed, run, maybeStore. The rest is bookkeeping.

Now let's pull each one apart.

Planning: The Step Before The Step

The single biggest difference between a one-shot prompt and an agent is that an agent thinks about what to do before it does it.

A naive setup looks like this:

const reply = await model.complete(`User wants: ${goal}. Do it.`);

The model sees the goal, jumps straight to action, and you're trusting its first instinct on a task that might need five steps. For trivial tasks this is fine. For anything multi-step it falls apart: the model picks a tool, gets a confusing result, panics, and starts hallucinating progress.

A planning step changes the game:

planning.ts

const plan = await model.complete({
  system: PLANNER_SYSTEM,
  user: `Goal: ${goal}\n\nProduce a short numbered plan. Each step must be either a tool call (name + args) or a direct answer. Do not execute anything yet.`,
});

for (const step of parsePlan(plan)) {
  if (step.kind === "tool") {
    const result = await tools.run(step.name, step.args);
    state.observations.push(result);
  }
}

You're asking the model to commit to a plan before it touches anything. The plan becomes auditable. You can show it to the user, log it, even let a different model review it. When something goes wrong, you have a record of what the agent intended versus what it did.

Plan-Then-Execute Versus ReAct

There are two dominant planning styles, and they have very different ergonomics.

Plan-then-execute is what we just wrote: the model produces a full plan up front, then a runner steps through it. Clean to debug, easy to log, hard to recover from when reality differs from the plan. The model didn't know the file would be 500MB. It didn't know the API would return a different schema. The plan is now wrong and the runner doesn't know how to adapt.

ReAct (reason + act) interleaves thinking and acting. On every turn the model writes a short rationale, picks one tool call, observes the result, then writes the next rationale. The model can adjust as it learns. You pay for it in tokens and latency (every turn pays the full context cost), but the agent stays honest about reality.

react_loop.py

def react_step(state):
    response = model.complete(
        system=REACT_SYSTEM,
        messages=state.messages,
    )
    thought, action = parse_react(response)
    state.log(thought)

    if action.kind == "final":
        return action.value

    observation = tools.run(action.name, action.args)
    state.messages.append({"role": "assistant", "content": response})
    state.messages.append({"role": "user", "content": f"Observation: {observation}"})
    return None

You're not picking one style forever. A lot of useful agents do plan-then-execute with a re-plan trigger: the model writes a plan, the runner executes until it hits a surprise, then the runner asks the model for a new plan from the current state. Cheaper than pure ReAct, more adaptive than pure plan-then-execute.

What A Good Plan Actually Looks Like

A common failure here is letting the model write plans that are too abstract.

Understand the user's request.

Gather relevant information.

Provide a helpful response.

That plan is useless. It's the agent equivalent of a meeting agenda that says "discuss things". A useful plan names tools and arguments:

Call list_files on ./src/api.

For each file matching *_handler.go, call read_file.

Search for the string "github.com/old/dep" across results.

If matches found, call propose_patch per file.

Run go test ./... to verify nothing broke.

Return summary with file count and test status.

You enforce this shape with the system prompt and with examples. A line like "Each step must reference a tool from the tool list and include concrete arguments. Steps that say 'understand' or 'analyze' will be rejected." does more work than people expect.

Tool Use: Where The Agent Actually Touches The World

Without tools, an agent is a chatbot. With tools, it can do real things: read files, hit APIs, query databases, send messages, run commands. This is where every interesting capability comes from, and where the most dangerous failures happen.

A tool, mechanically, is three things: a name, a JSON schema for its arguments, and a function the runtime calls when the model picks it.

tool-definition.ts

const readFile = {
  name: "read_file",
  description:
    "Read a file from the project. Use to inspect code or config. " +
    "Do not use for binary files or anything larger than 256KB.",
  parameters: {
    type: "object",
    properties: {
      path: {
        type: "string",
        description: "Path relative to the project root. No leading slash.",
      },
    },
    required: ["path"],
  },
  handler: async ({ path }: { path: string }) => {
    if (path.startsWith("/")) throw new Error("absolute paths not allowed");
    return await fs.readFile(join(projectRoot, path), "utf8");
  },
};

Four things are doing the heavy lifting here, and three of them are not code.

The Description Is The Prompt

Models pick tools based on the description, not the name. A tool called read_file with a vague description gets called for "find the user's email" because the model thinks "well, the email is probably in a file somewhere." A description that says "Read a single file when you already know its path. Do not use this for searching. Use grep_repo for that." will save you a hundred wrong tool calls.

Treat tool descriptions like little spec sheets. List what the tool is for, what it isn't for, the shape of valid input, the shape of valid output, and any edge cases the model needs to know.

Schemas Aren't Suggestions

The JSON schema is your only contract. If the model invents an argument that isn't in the schema, your validator should reject the call before it reaches the handler. If a required field is missing, same. If a string is supposed to be one of ["read", "write", "delete"] and the model sends "REMOVE", reject.

Models are good but they freelance. Rejecting bad tool calls and feeding the error back to the model is better than accepting them: the model learns mid-loop and adjusts.

function validateToolCall(call: ToolCall, schema: JSONSchema) {
  const result = ajv.validate(schema, call.args);
  if (!result) {
    return {
      ok: false,
      feedback: `Tool call rejected. Errors: ${ajv.errorsText()}. Retry with valid arguments.`,
    };
  }
  return { ok: true };
}

Side Effects Need A Different Class Of Tool

There's a category boundary that frameworks often blur but you shouldn't: read tools and write tools are different animals.

Read tools are cheap to retry. If list_files returns nothing, you call it again with different args. No harm done.

Write tools (apply_patch, send_email, deploy_service, run_sql) are expensive to undo, sometimes impossible. These deserve their own permission tier, their own logging, often their own approval step. We'll come back to this under constraints, but design tools knowing which side they sit on.

A Tool Bus, Not A Switch Statement

When you have three tools, a switch statement is fine. When you have thirty, you want a tool registry that does schema validation, logging, timeout enforcement, and side-effect classification in one place.

tool_bus.py

class ToolBus:
    def __init__(self):
        self.tools: dict[str, Tool] = {}

    def register(self, tool: Tool):
        self.tools[tool.name] = tool

    async def run(self, name: str, args: dict, *, caller: AgentId):
        tool = self.tools.get(name)
        if tool is None:
            return ToolResult.error(f"unknown tool: {name}")

        valid, err = tool.validate(args)
        if not valid:
            return ToolResult.error(f"invalid args: {err}")

        async with self.metrics.time(tool.name):
            try:
                value = await asyncio.wait_for(
                    tool.handler(args), timeout=tool.timeout_s
                )
                self.log(caller, tool.name, args, value)
                return ToolResult.ok(value)
            except asyncio.TimeoutError:
                return ToolResult.error(f"tool timed out after {tool.timeout_s}s")
            except Exception as exc:
                return ToolResult.error(f"tool failed: {exc}")

This single class is where you'll later add rate limits, audit trails, dry-run mode, and cost tracking. Build it on day one, even if it feels overkill: the alternative is bolting these concerns onto a sprawl of one-off tool handlers later, which is much worse.

Memory: The Word That Hides Three Different Things

"Memory" is the most overloaded word in the agent vocabulary. It usually means at least three different mechanisms stitched together, and conflating them is a leading cause of "why did the agent forget what I just told it?" bugs.

Working Memory (The Context Window)

This is the conversation so far, plus tool results, plus the system prompt. It lives in the model's context window and disappears the moment the request returns. It's bounded by the model's context length and your wallet.

Most "the agent forgot" complaints are about working memory. You ran two separate requests, the second one didn't include the relevant history, the model genuinely has no idea what you're talking about. The fix isn't a vector database. The fix is including the history.

Scratchpad Memory (Within A Run)

Inside a single agent run, the model often benefits from a place to "write notes to itself." This is just structured working memory: a list of observations, intermediate results, decisions and their reasoning.

scratchpad.ts

type ScratchpadEntry =
  | { kind: "observation"; toolName: string; result: unknown }
  | { kind: "decision"; rationale: string; choice: string }
  | { kind: "note"; text: string };

class Scratchpad {
  entries: ScratchpadEntry[] = [];

  add(entry: ScratchpadEntry) {
    this.entries.push(entry);
  }

  render(maxTokens: number): string {
    return formatRecent(this.entries, maxTokens);
  }
}

The scratchpad is what you feed back into the model on the next turn. It's not magic. It's a structured replay of the agent's own work. The trick is keeping it short enough to fit. A scratchpad that just appends forever is how agents lose their minds on long tasks.

Long-Term Memory (Across Runs)

This is what people usually mean when they say "memory": a store of facts the agent can recall in future conversations. User preferences, project context, the result of expensive computations, lessons from past failures.

There are three popular shapes:

Shape	Looks like	Good for	Bad at
Key/value	A redis or a flat file	Stable facts (user role, preferred language)	Anything fuzzy or semantic
Vector store	Pinecone, pgvector, Chroma	Semantic recall over notes/docs	Exact matches, freshness, contradictions
File-based	A `memory/` directory of markdown files	Auditable, editable, structured	Scale beyond a few thousand entries

File-based memory is underrated. Claude Code and a few other agent tools use exactly this: a directory of markdown files, indexed by a small MEMORY.md. The agent reads, writes, and edits files. There's nothing to migrate, you can git diff it, and the user can delete a memory by deleting a file. It scales worse than vectors but it's vastly easier to reason about, and the failure mode is "we couldn't find the right file" rather than "we semantically retrieved the wrong fact and the agent confidently used it."

Memory As A Tool, Not A Background Process

The cleanest design decision in this whole space: memory is just two more tools. recall(query) and remember(fact). The model decides when to recall and when to remember, the same way it decides when to read a file or send a message.

The alternative, a background process that magically injects "relevant memories" into every prompt, sounds convenient and is actually a nightmare to debug. You'll spend more time explaining why the agent randomly mentioned the user's old API key than you saved by automating retrieval.

When memory is a tool, you can ask the agent to show its work. You ask "Why did you think the user wanted the Go example?" and the agent says "I called recall('user language preference') and got back: 'Prefers Go for backend examples (2026-04-02).'" That's an answerable question, in a way "the embeddings retrieved it" never is.

Memory Hygiene

Stale memory is worse than no memory. An agent that remembers your team uses Postgres when you switched to MySQL six months ago is going to produce confidently wrong advice forever.

A few rules that have aged well:

Timestamp every memory. The agent needs to know it might be stale.
Reasons over facts. A memory like "avoid npm install here: pnpm is the package manager (legacy package-lock.json exists but is unused)" beats "use pnpm". The reason lets future agents (or the same one tomorrow) judge edge cases.
Make deletion cheap. If updating a memory requires a vector reindex, it won't happen. If it's a file edit, it will.
Verify before relying. If a memory says a function exists, grep for it before recommending it. Memory is a hypothesis, not a source of truth.

Constraints: The Difference Between A Demo And A System

Agents in demos are unconstrained. They have full filesystem access, can call any tool, run any command, spend any number of tokens, take any number of turns. The demo works because the demonstrator is watching every step.

Production agents are not watched. The constraints are what let you sleep.

Permissions Are A Policy Object, Not An If Statement

The pattern I've seen survive contact with reality is treating permissions as a separate first-class object. The agent core calls guard.assertAllowed(toolName, args) before every tool call, and the guard says yes or no based on a policy that you can read in one place.

permissions.ts

type Policy = {
  allowedTools: string[];
  pathAllowlist: RegExp[];
  pathDenylist: RegExp[];
  requireApprovalFor: string[];
  maxToolCallsPerRun: number;
  maxTokensPerRun: number;
};

class Guard {
  constructor(private policy: Policy, private state: RunState) {}

  assertAllowed(name: string, args: Record<string, unknown>) {
    if (!this.policy.allowedTools.includes(name)) {
      throw new GuardError(`tool not in allowlist: ${name}`);
    }
    if (this.state.toolCalls >= this.policy.maxToolCallsPerRun) {
      throw new GuardError("max tool calls exceeded");
    }
    if (typeof args.path === "string") {
      const path = args.path;
      if (this.policy.pathDenylist.some((re) => re.test(path))) {
        throw new GuardError(`path on denylist: ${path}`);
      }
      if (!this.policy.pathAllowlist.some((re) => re.test(path))) {
        throw new GuardError(`path not in allowlist: ${path}`);
      }
    }
    if (this.policy.requireApprovalFor.includes(name)) {
      throw new ApprovalRequired(name, args);
    }
  }
}

This is unglamorous code and it does more for your safety story than any clever prompt.

The Four Constraints Every Agent Needs

Almost every long-lived agent system converges on the same four:

Tool allowlist. The agent can call only these named tools. Anything else is rejected before the handler runs. This stops "I'll just write a delete_everything tool real quick" patterns and tightens the surface area enormously.
Iteration budget. A hard cap on tool calls per run. Agents will absolutely loop forever if you let them: re-reading the same file, retrying a failing API, "thinking more about it." Pick a number based on your task complexity and bail when you hit it. Better to fail loudly than to silently rack up an API bill.
Token/cost budget. Independent of iterations, count tokens. Long tool outputs eat budget faster than you'd expect. When you hit the cap, the agent stops and reports.
Approval gate for side effects. Any tool that changes the world outside the agent's sandbox (sends an email, hits prod, files a PR, charges a card) goes through a separate ApprovalRequired path. The agent proposes; a human (or a stricter automated check) disposes.

The pattern for the fourth one is worth lingering on, because it's where teams over-engineer the most.

Approval Gates Without Building A Workflow Engine

You don't need a fancy approval workflow. You need a way to pause the agent, surface what it wants to do, and let a human respond.

approval.py

class Approval:
    async def request(self, action: str, args: dict, *, justification: str):
        ticket = await self.store.create(
            action=action, args=args, justification=justification, status="pending"
        )
        await self.notifier.send(
            channel="approvals",
            text=f"Agent wants to {action} with {args}. Why: {justification}",
            ticket_id=ticket.id,
        )
        return ticket

That's the whole approval system. Ticket in a database, message in Slack (or email, or wherever your humans live), a way to look up the result. The agent calls it, the run pauses (or returns a "waiting for approval" status), a human clicks yes or no. You can add SLAs, escalation, batching later, but the simple shape ships in a day and covers 90% of what you need.

Verification: Where The Trust Actually Comes From

The model is confident by default. It will say the code works. It will say the deployment succeeded. It will say the test passed. None of those statements should be trusted on their own.

Verification is the difference between an agent that claims to have done the work and an agent that can prove it did. Every serious production agent has a verification step somewhere: sometimes it's external (run the tests), sometimes it's another agent (an independent reviewer), sometimes it's the same agent checking its own work against an explicit rubric.

External Verification: The Bar Is Higher Than You Think

For coding agents, this is almost always running a real tool against the real artifact.

verify-code-change.ts

async function verifyCodeChange(change: ProposedChange) {
  await applyToWorktree(change);

  const typeCheck = await run("tsc", ["--noEmit"]);
  if (typeCheck.exitCode !== 0) return failure("type check failed", typeCheck);

  const lint = await run("npm", ["run", "lint"]);
  if (lint.exitCode !== 0) return failure("lint failed", lint);

  const tests = await run("npm", ["test", "--", "--run"]);
  if (tests.exitCode !== 0) return failure("tests failed", tests);

  return success();
}

This is unglamorous and it's the single best lever you have for agent reliability. If the agent claims it fixed a bug, run the test. If the agent claims it refactored something safely, run the typechecker. The model is allowed to lie. The compiler isn't.

For non-coding agents, the analog is whatever your domain has: a schema check on the JSON output, a regex on a date format, an actual API call to confirm the resource exists, a SQL query to verify the row was inserted.

Self-Critique: Cheaper Than People Think, Less Magic Than People Hope

The simplest self-critique is one extra model call: "Here is what you produced. Here is the rubric. List every place the output fails the rubric. If none, say 'OK'."

self_critique.py

async def critique(output: str, rubric: str) -> CritiqueResult:
    response = await model.complete(
        system=CRITIC_SYSTEM,
        user=f"Output:\n{output}\n\nRubric:\n{rubric}\n\nList violations or say OK.",
    )
    if response.strip() == "OK":
        return CritiqueResult.ok()
    return CritiqueResult.violations(parse(response))

This works better than you'd expect, and worse than people pretend. It catches the obvious stuff: missing fields, factual contradictions, style violations the rubric explicitly names. It misses subtle reasoning errors because the same model that made the error is now grading it. Useful, not sufficient.

Independent Reviewer Agents

When the cost is justified, the pattern is to have a different agent (or at least a different prompt with a different role) review the output. The reviewer doesn't see the original chain of thought, only the final artifact and the original goal. It's much closer to a human code review and catches a different class of mistake.

This is also where the "judge agent" pattern lives. The judge has a strict rubric, refuses to be polite, and returns a structured verdict. You don't ship the output until the judge approves.

Verification Loops, Not Verification Steps

The most underrated move is making verification part of the loop, not a final gate. If verifyCodeChange fails with "test X failed: expected 200, got 500", you feed that observation back to the model and let it try again. Same with critique violations. Same with judge rejections.

verify-and-retry.ts

for (let attempt = 0; attempt < ctx.maxVerifyAttempts; attempt++) {
  const output = await ctx.agent.produce();
  const verdict = await ctx.verifier.check(output);
  if (verdict.ok) return output;
  ctx.state.observations.push({
    kind: "verification_failed",
    detail: verdict.feedback,
  });
}
throw new Error("verification failed after retries");

The model that ignored the test on attempt 1 sees the actual error on attempt 2 and usually fixes it. That's not magic. It's just letting the model see what went wrong, which the unconstrained version of itself never bothered to check.

Putting The Pillars Together

If you skim the pillars individually they look like five separate features. They're not. They're five views of the same loop, and the interesting design choices are about how they interact.

A plan that ignores constraints is a plan the agent can't execute. A tool registry without verification produces actions you can't audit. Memory without hygiene corrupts every future plan. Verification without retries is a wall; verification inside a loop is a teacher. Constraints without observability are a black box that fails silently in production.

The teams whose agents work in production have all stopped chasing "smarter prompts" and started shipping plumbing. Better tool descriptions. Tighter schemas. A real permissions object. An honest budget. A verifier that actually runs the tests. A memory tier that the user can grep.

None of that is sexy. It's all just software engineering. Which is exactly the point: once you stop expecting magic, the work becomes legible, the failures become diagnosable, and the agent stops being a mysterious black box and starts being a system you maintain like any other.

The model gets to be the clever part. Everything around it is your job, and that's where the difference between a demo and a product really lives.

Top comments (6)

Golen • Jun 29

Really enjoyed this article! It does a great job explaining how AI agents combine planning, memory, and verification in a way that's easy to follow. Thanks for sharing such a clear and insightful breakdown!

Nazar Boyko • Jun 29

Thanks, glad it resonated! The verification loop is the part I find most underrated, easy to skip but it changes everything in production.

Mudassir Khan • Jul 1

the 'memory is just two tools' section is the one i keep coming back to. we had a background retrieval system that silently injected 'relevant' context and spent two sprints debugging why the agent kept recommending a deprecated endpoint. turned out the vector search was surfacing an old doc at high similarity. impossible to trace without reproducing the exact query.

switched to explicit recall/remember tool calls and suddenly the agent's reasoning became auditable. you can ask it why it made a decision and it can actually show you.

the file based memory point is underrated too. we use it on a few internal tools and the ability to git diff the memory state after a bad run is genuinely useful. what made you land on file based over vector for the long term memory examples in the post?

Grok • Jun 30

Really enjoyed the focus on the engineering side of agents rather than the hype. One thing I kept thinking about while reading is observability. Planning, memory, constraints, and verification all become much easier to debug if every decision is emitted as structured events (plan created, tool selected, memory recalled, verification failed, etc.). In practice, I've found that good traces often improve an agent more than another prompt iteration because you can finally see why it made a bad decision. I'm curious whether you'd consider observability a sixth pillar, or if you see it as something that should be woven through all five.

Nazar Boyko • Jun 30 • Edited

Great point and you caught what the piece only hints at in one line near the end. My take: woven through, not a sixth pillar. The events (plan_created, tool_selected, verification_failed) are just the five loop steps emitting spans. The one part that is its own plumbing is the trace store itself, it sits alongside the loop like the ToolBus or Guard. And totally agree: most agent failures are decision failures, and you can't fix a decision you can't see. 🙌

David@Opace • Jul 2

Useful breakdown. I like the framing that most agent failures are workflow failures, not just model failures especially around constraints and verification.