Build a Tool-Using Agent in 80 Lines of Python

#ai #agents #python #tutorial

Book: AI Agents Pocket Guide
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

You ask the model "what's 17% of $4,820, and who founded Stripe?" It answers $819.40 and "Patrick and John Collison." The number is wrong by twelve cents, and you have no way to know which half it made up. That is the moment most teams stop pretending an LLM call is an agent.

An agent is a loop. The model picks a tool, your code runs it, the result goes back into the conversation, and the model decides whether to call another tool or stop. Eighty lines of Python is enough to build one that works: calculator, a fake search, an iteration cap, and an error path you can actually debug. No LangChain. No CrewAI. Just the Anthropic SDK and a while loop.

The shape of the loop

Before any code, the contract. You send messages.create with a list of tools. The model returns a response with stop_reason. If the stop reason is end_turn, you're done; read the text and return it. If the stop reason is tool_use, the response contains one or more tool_use blocks with an id, a tool name, and an input dict. You execute each one, build a user message with tool_result blocks keyed to those ids, and call the API again. Loop until the model says it's done or until you hit a cap. The official Anthropic docs spell out the same handshake on the Tool use with Claude page.

Two things kill toy agents:

No iteration cap. The model gets stuck calling the same tool, or two tools ping-pong forever, and your bill quietly hits triple digits.
No error path. A tool raises, you crash the loop, and the model never gets to recover. Real agents tell the model the tool failed and let it try something else.

The two tools

Pick boring tools so the loop is the interesting part. A calculator that evaluates a math expression, and a fake search that returns canned snippets. Both take a single string and return a string.

import ast
import operator as op

_OPS = {
    ast.Add: op.add, ast.Sub: op.sub,
    ast.Mult: op.mul, ast.Div: op.truediv,
    ast.Pow: op.pow, ast.USub: op.neg,
}

def _eval(node):
    if isinstance(node, ast.Constant):
        return node.value
    if isinstance(node, ast.BinOp):
        return _OPS[type(node.op)](_eval(node.left),
                                   _eval(node.right))
    if isinstance(node, ast.UnaryOp):
        return _OPS[type(node.op)](_eval(node.operand))
    raise ValueError("unsupported expression")

def calculator(expression: str) -> str:
    tree = ast.parse(expression, mode="eval").body
    return str(_eval(tree))

eval() would be three lines. It would also let the model run arbitrary Python in your process the moment a prompt-injection slips through. The AST walker above handles + - * / ** () and refuses everything else. That's the bar for a tool the model can call from untrusted input.

Now the fake search. In a real system this is a vector store, a Bing call, your internal API. Here it's a dictionary so the post stays runnable offline.

_FAKE_INDEX = {
    "stripe": "Stripe was founded in 2010 by Patrick "
              "and John Collison in Palo Alto.",
    "anthropic": "Anthropic was founded in 2021 by "
                 "Dario and Daniela Amodei.",
    "python": "Python first released in 1991, "
              "created by Guido van Rossum.",
}

def search(query: str) -> str:
    q = query.lower()
    hits = [v for k, v in _FAKE_INDEX.items() if k in q]
    return "\n".join(hits) if hits else "No results."

Two functions, both pure, both raising on bad input.

Telling Claude what tools exist

Each tool needs a JSON schema description. The model reads the description to decide when to call it, so write it like a docstring an intern would read.

CALCULATOR_TOOL = {
    "name": "calculator",
    "description": (
        "Evaluate an arithmetic expression. "
        "Supports + - * / ** and parentheses. "
        "Use for any numeric computation."
    ),
    "input_schema": {
        "type": "object",
        "properties": {
            "expression": {"type": "string"},
        },
        "required": ["expression"],
    },
}

The description matters more than the name. The model reads it during tool selection and uses the wording to decide whether the user's question maps to this tool. "Use for any numeric computation" is direct instruction, not a comment for humans.

SEARCH_TOOL = {
    "name": "search",
    "description": (
        "Look up a short factual snippet by "
        "keyword. Use for company founders, "
        "release dates, and similar facts."
    ),
    "input_schema": {
        "type": "object",
        "properties": {
            "query": {"type": "string"},
        },
        "required": ["query"],
    },
}

TOOLS = [CALCULATOR_TOOL, SEARCH_TOOL]
DISPATCH = {"calculator": calculator, "search": search}

The DISPATCH dict is the routing table. When the model emits a tool_use block with name="calculator", you look it up here and call it with the model's input. Keeping routing in one dict makes adding a third tool a one-liner. More on that at the end.

The agent loop

The full loop:

import anthropic

client = anthropic.Anthropic()
MODEL = "claude-sonnet-4-5"
MAX_STEPS = 6

def run_agent(user_prompt: str) -> str:
    messages = [{"role": "user",
                 "content": user_prompt}]
    for step in range(MAX_STEPS):
        resp = client.messages.create(
            model=MODEL,
            max_tokens=1024,
            tools=TOOLS,
            messages=messages,
        )
        messages.append({"role": "assistant",
                         "content": resp.content})

Each iteration starts the same way: call the API with the current message history, then append the assistant's response so the next turn sees what the model just said. The branch on stop_reason decides what happens next.

        if resp.stop_reason == "end_turn":
            return _final_text(resp)
        if resp.stop_reason != "tool_use":
            raise RuntimeError(
                f"unexpected stop: {resp.stop_reason}"
            )
        tool_results = []
        for block in resp.content:
            if block.type != "tool_use":
                continue
            tool_results.append(_run_tool(block))
        messages.append({"role": "user",
                         "content": tool_results})
    raise RuntimeError(
        "hit MAX_STEPS without final answer"
    )

The loop is six API calls maximum. Tunable per workload. Forty calls is a runaway agent; six is a working one for tasks like "compute this and look up that." The cap is non-negotiable. Without it, a confused model will keep calling tools until you notice on the bill.

Two helpers do the work the loop punted on:

def _final_text(resp) -> str:
    parts = [b.text for b in resp.content
             if b.type == "text"]
    return "\n".join(parts).strip()

def _run_tool(block) -> dict:
    fn = DISPATCH.get(block.name)
    if fn is None:
        return {
            "type": "tool_result",
            "tool_use_id": block.id,
            "content": f"unknown tool: {block.name}",
            "is_error": True,
        }
    try:
        out = fn(**block.input)
    except Exception as e:
        return {
            "type": "tool_result",
            "tool_use_id": block.id,
            "content": f"{type(e).__name__}: {e}",
            "is_error": True,
        }
    return {
        "type": "tool_result",
        "tool_use_id": block.id,
        "content": str(out),
    }

is_error: True is the line every tutorial skips. When you set it, the model sees the failure as a tool error and decides what to do next: retry with different input, switch tools, or give up cleanly. Without it, an exception in your tool either crashes the loop or returns a string the model treats as a successful answer. Both end badly. The behavior is documented under tool result error handling.

Running it

if __name__ == "__main__":
    q = ("What's 17% of 4820, and who founded "
         "Stripe? Show me the math.")
    print(run_agent(q))

On a working API key, you get something like:

17% of 4820 is 819.40, computed as 4820 * 0.17.
Stripe was founded in 2010 by Patrick and John
Collison in Palo Alto.

Two tool calls, one final assistant turn, three API round-trips total. The model picked calculator for the math, search for the founders, and composed the answer from both results. If you swap the prompt to "what's 5 divided by 0", the calculator raises ZeroDivisionError, the loop sends is_error: True back, and the model replies with a graceful "that's undefined" instead of crashing.

What breaks first when you scale this

The 80-line agent is honest at one user per process and simple tools. Three things break the moment you push it.

Tool latency dominates. A real search tool is 200–800ms. Six-step agents become five-second agents fast. You want tools to run in parallel when the model emits multiple tool_use blocks in one turn. Replace the for block in resp.content loop with asyncio.gather over the tool calls. The model already emits parallel tool calls for independent lookups; your code has to stop serializing them.

The conversation grows quadratically. Every step appends both an assistant message (with all tool_use blocks) and a user message (with all tool_result blocks). After ten steps on a real workload, you're sending 8k tokens of tool history on every call. Cache the static prefix with prompt caching, or summarize old turns once they're behind a checkpoint.

Tool errors are silent in production. The model recovers from is_error: True so smoothly that you never notice your search tool has been timing out for two weeks. Log every tool call with its inputs, latency, and error flag. An agent without per-tool observability is a black box; the loop hides bugs the same way it hides successes.

Adding a third tool

Write the function, add an entry to TOOLS with the JSON schema, and add a line to DISPATCH. A get_weather(city) tool that hits a real API is twelve more lines and one extra row in the routing table. The loop doesn't change.

If this was useful

The loop here is the smallest honest version of the agent pattern. The AI Agents Pocket Guide covers the parts that show up next: parallel tool dispatch, multi-agent handoff, memory and state across runs, and the failure modes you only meet in production. If you're past "make it work once" and onto "make it work every time," it's the book for that gap.