Self-Improving AI Agents: Turn Repeated Reasoning Into Tools the Agent Writes Itself

#ai #programming #tutorial #python

💻 All the code for this series lives in one repo: resilient-agent-harness-sample-for-aws. This post is the Self-Improving Skills demo (04-self-improving-skills). Clone it and follow along.

A senior engineer who keeps solving the same problem by hand eventually stops, writes a function, tests it, and never solves that problem by hand again. The reasoning happened once; every call after that is a cheap, exact invocation. That instinct, turn repeated work into a tool, is what most AI agents are missing.

A static agent re-reasons the same kind of task from scratch every single time. Ask it to total a list of numbers today and it derives an answer; ask again tomorrow and it derives it again, burning tokens, and sometimes getting it wrong differently on each run, with no way to tell it was wrong. Nothing it learned the first time sticks.

A self-improving agent does what the engineer does: it solves the task once, writes a small tool for that capability, confirms the tool runs, and reuses it exactly from then on. The repeated reasoning becomes a deterministic function call.

The catch worth saying out loud first: writing the tool costs more tokens than one-off reasoning, not fewer. Authoring code at runtime is token-heavy. The payoff is correctness and reuse (build once, then call it exactly forever), not a smaller bill on the first pass. I built a runnable demo that measures exactly that trade-off, no hand-waving. The full code is in the resilient-agent-harness repo.

What is the demo?

A single agent, built with Strands Agents, works through four fare-math tasks over real fares pulled from the Duffel sandbox: total these fares, count the ones over a threshold, sum the cheapest two. The fourth task repeats the first task's capability on purpose, so you can watch reuse happen. Each task runs two ways (a static agent and a self-improving one), and the demo measures real tokens plus whether each answer is exact against a Python-computed ground truth.

What is a self-improving AI agent?

A self-improving AI agent extends its own toolkit at runtime: it solves a task, writes a small tool for that capability, loads the tool into itself, and reuses it on later tasks instead of re-reasoning from scratch. What improves is the agent's toolkit (the set of functions it can call), not the model's weights. There is no fine-tuning and no training step. The same model runs the whole time; it just accumulates tools it authored, the way a developer accumulates a personal library of helpers.

That distinction matters. "Self-improvement" sounds like the model is getting smarter. It isn't. The deterministic harness around the model is getting richer, and that's where the durable gain lives.

How does meta-tooling work, and why Strands makes it possible

The "writes its own tools" part isn't a homemade trick; it's a documented Strands capability called meta-tooling. Strands ships three tools that let an agent author and hot-load code into itself:

editor writes the tool's .py file.
load_tool hot-loads that file into the agent so it becomes one of its own tools.
shell runs or debugs it if a load fails.

The diagram shows the loop the agent follows for each task: if it already has a tool for this capability it just reuses it (the green path); if not, it uses editor to write a tools/<name>.py file, load_tool to load that file into its own toolkit, shell to debug if needed, and then calls the new tool for an exact, deterministic result.

from strands import Agent
from strands_tools import editor, load_tool, shell

agent = Agent(tools=[editor, load_tool, shell], system_prompt=BUILDER_PROMPT)

# The agent writes ./tools/total_fares.py with an @tool function, loads it, then calls it.
agent("Add a tool named total_fares that sums a list of fares, then use it on [229.92, 360.67, 395.14].")

print(agent.tool_names)   # -> [..., 'total_fares']  the agent extended its own toolkit

For each new task, if the agent already has a tool for that capability it just calls it (a plain tool call, no re-authoring); otherwise it writes and loads a new one. Here is the actual tool the agent wrote for the "total all fares" capability in one run: small, typed, deterministic.

@tool
def total_fares(fares: list[float]) -> float:
    return round(sum(fares), 2)

That's the whole idea. The agent saw it would keep needing this, wrote it once, and from then on the sum is computed by Python, not approximated by a language model.

How do static and self-improving compare?

A measured run on OpenAI gpt-4o-mini gave me this shape (the static agent reads answers with structured_output_model=NumberAnswer, so correctness is a numeric comparison against ground truth, not a regex scrape of free text):

	Static agent	Self-improving agent
How it answers	Re-reasons every task by hand	Writes a tool once, loads it, reuses it
Tasks solved exactly	~2/4	4/4
Answers verifiable	0/4 (no way to check itself)	4/4 (a tool that runs is deterministic)
Model tokens (single pass)	~814	~129,000
Tools built / reused	0 / 0	3 built / 1 reused

Read the token row carefully: the self-improving agent uses far more tokens on this single pass, roughly 158x more (dividing the two figures above). That is not a typo and not the part to gloss over. Authoring tools with editor, load_tool, and shell means writing a file, loading it, and sometimes debugging it, which is genuinely expensive.

Does it use fewer tokens?

No. On a single pass it uses more, a lot more. If you ran each task exactly once and never again, the static agent is cheaper in raw tokens.

The win is not the token bill; it's what happens on repetition and on the hard cases:

Reuse. Once a tool exists, every later call is a plain, exact tool call with no re-reasoning. The static agent re-pays its full reasoning cost on every repeat, and production sends the same kind of work over and over.
Correctness. Summing several real fares with decimals is a genuine weakness for a small model: it approximates and cannot tell it's wrong. That's deterministic work that belongs in code. The self-improving agent writes that code once and is exact from then on, and a tool that runs is verifiable in a way free-text reasoning never is.

So the honest framing is "build once, then run it exactly and forever," not "fewer tokens." Anyone promising that self-improvement shrinks the bill on the first pass is selling the wrong story.

Is it safe to run agent-written code?

The agent writes files and runs code, so the demo sets BYPASS_TOOL_CONSENT=true; otherwise editor, shell, and load_tool would block on an interactive confirmation prompt and hang the notebook. That flag is set knowingly, because this demo runs the agent's own generated math helpers on local data.

For untrusted code in production, don't run it on the host. Strands ships Sandbox and PosixShellSandbox to isolate generated code, and a production runtime such as Amazon Bedrock AgentCore gives each session an isolated runtime plus a versioned tool registry, so the tools an agent earns persist across sessions instead of being re-guessed each time. The thesis holds at every scale: deterministic work belongs in a tool the agent writes once and reuses, not re-derived and re-paid for on every call.

Frequently asked questions

Is this a multi-agent system?
No. It's a single agent improving its own toolkit. There's no swarm and no graph of agents; the "self-improvement" is one agent writing and hot-loading its own tools via meta-tooling.

Does the model get fine-tuned or retrained?
No. The model is untouched. What grows is the agent's set of callable tools. Same weights start to finish; the agent just accumulates functions it authored.

Why does the static agent get answers wrong?
Summing several real fares with decimals is a deterministic task a small model approximates and can't self-check. The self-improving agent moves that work into a tiny Python function, so it's computed exactly instead of guessed.

Do I need OpenAI for this?
No. Strands is model-agnostic: its providers are interchangeable, so the same code runs on Amazon Bedrock (the default), Anthropic, OpenAI, or a local model via Ollama. The demo defaults to OpenAI gpt-4o-mini because it needs only an API key to try, though that's still a cloud API call, not a model on your machine.

Run it yourself

The full before/after (four fare tasks over real Duffel fares, a static agent that re-reasons versus an agent that writes, loads, and reuses its own tools, with real token and correctness numbers) runs end to end in one notebook. Clone the repo and run it:

git clone https://github.com/elizabethfuentes12/resilient-agent-harness-sample-for-aws.git
cd resilient-agent-harness-sample-for-aws/04-self-improving-skills

uv venv && source .venv/bin/activate
uv pip install -r requirements.txt

# Default: OpenAI gpt-4o-mini (just an API key to try)
echo "OPENAI_API_KEY=sk-..." > .env
echo "DUFFEL_API_KEY=duffel_test_..." >> .env   # free sandbox token from app.duffel.com
uv run test_self_improving_skills.py

Prefer notebooks? Open test_self_improving_skills.ipynb and run it top to bottom.

The pattern follows Memento-Skills (Zhou et al., Mar 2026) and SAGE (Peng et al., Mar 2026), both on agents that improve at inference time with no fine-tuning. The benchmark figures and full reading are in the repo's README. What this demo produces is the real, measured token-and-correctness contrast on your chosen model.

What repeated reasoning is your agent re-paying for on every call, work it could write into a tool once and never re-derive again? Tell me in the comments.

📬 Building reliable AI agents? I write about agent memory, guardrails, evaluation, and multi-agent patterns. Subscribe to my newsletter to get the next one.

Gracias!

🇻🇪 Dev.to Linkedin GitHub Twitter Instagram Youtube