DEV Community

Cover image for I built an AI agent in 50 lines of Python. Here’s what everyone gets wrong about them.
<devtips/>
<devtips/>

Posted on

I built an AI agent in 50 lines of Python. Here’s what everyone gets wrong about them.

You’ve been calling things agents for months. Most of them are just chatbots with extra steps.

Every senior developer I know has done this at least once: you’re mid-review, someone asks how the agent loop actually works under the hood, and your brain quietly blue-screens. You’ve been using Claude, Cursor, Copilot, and Junie every single day. You’ve shipped features on top of them. You’ve nodded confidently in sprint planning while someone called the autocomplete a “multi-agent orchestration layer.” And yet, if someone put a gun to your head and asked you to draw the architecture the actual mechanics of what makes something an agent instead of a really fast autocomplete you’d stall.

That was me about six weeks ago.

I got fed up with not knowing, so I did what I always do when a concept won’t click: I built the stupidest possible version of it from scratch. No LangChain. No LangGraph. No CrewAI. No framework scaffolding to hide behind. Just Python, an API key, and a while loop. Fifty lines. And somewhere around line thirty-two, the thing that’s been fuzzy for a year finally came into focus.

Turns out an agent isn’t magic. It’s not even particularly clever. It’s a deterministic loop with a short memory and a decision point. The model doesn’t “think.” It reads a conversation history, decides whether it has enough to answer or needs to call a tool, and either exits or loops. That’s it. Everything else the retry logic, the subagents, the memory persistence, the human-in-the-loop approvals is infrastructure layered on top of that core loop.

TL;DR: This article walks through building a real AI agent from first principle cloud API brain, then local models via Ollama, then mixed-mode orchestration, then MCP tool sharing, then a straight talk about where frameworks actually earn their keep. By the end, you’ll be able to explain exactly what’s happening when Claude Code spins up a subagent, and you’ll have the code to prove you understand it.

What an agent actually is (and why most “agents” aren’t)

Here’s the real definition, stripped of the marketing: a regular LLM call is a one-shot operation. You send a prompt, you get a response, done. The model has no memory of what just happened. It doesn’t know if it got it right. It doesn’t retry. It just answers and exits.

An agent is different because it loops. It takes a high-level task, reasons about what to do next, takes an action, observes the result, and keeps going until it decides it’s finished. That repeating cycle think, act, observe, decide is the thing that turns a language model into something that behaves like an autonomous system. Remove the loop and you just have a chatbot that called a function once.

Most things being marketed as “agents” right now are closer to the chatbot end of that spectrum. A RAG pipeline that retrieves documents and summarizes them isn’t an agent. A Slack bot that calls an API when you type a command isn’t an agent. They’re useful tools, but they don’t loop, they don’t observe results and adjust, and they don’t decide when to stop. Calling them agents is like calling a calculator a mathematician because it can add numbers.

The pattern that most real agents follow is called ReAct Reasoning and Acting, introduced by Yao et al. in a 2022 paper that’s worth at least skimming. The idea is simple: the model doesn’t jump straight to a final answer. It produces a thought about what to do, then an action (a tool call), then waits to observe the result before deciding what to do next. The loop continues until the model has enough context to answer directly without calling any more tools.

Think of it like a developer assigned a Jira ticket. They don’t read it once and immediately output the solution. They read it, try something, hit an error, read the error, Google something, try again, check if the tests pass, then mark it done. The agent does the same thing it’s just that the conversation history is the Jira thread, the tools are the shell commands, and the LLM is the developer who never sleeps and never complains about the sprint velocity.

Here’s what every single cycle of the loop looks like under the hood:

  1. Send the current conversation to the LLM system prompt, user message, any prior tool results
  2. The LLM returns either a final answer or a list of tool calls it wants to make
  3. If it’s a final answer, you’re done
  4. If it’s tool calls, execute them, append the results to the conversation, go back to step 1

That’s the entire architecture. The model has no consciousness and no genuine self-reflection what it has is the full conversation history sitting in its context window, and a system prompt telling it what tools exist and when to stop. The ReAct pattern turns that into something that looks like self-correction. And it works surprisingly well.

The part people consistently underestimate is the system prompt. It’s not a formality. It’s the steering wheel. It tells the model when to use tools, when a task is complete, what a final answer should look like, and what it should never do. When leaked system prompts from production agents at Anthropic and Apple showed up online, the striking thing wasn’t the model sophistication it was how long and careful the system prompts were. Hundreds of lines of plain English, constraining behavior with precision. The loop is simple. The steering is where the craft lives.

Points:

  • Agent = a loop with a decision point, not a single LLM call with extra marketing
  • ReAct (Reason + Act) is the foundational pattern think, act, observe, repeat until done
  • The model “self-corrects” because the conversation history accumulates not because it’s actually reflecting
  • The system prompt is the real logic layer underestimate it and your agent does whatever it wants

“The model doesn’t think. It reads a very long chat history and decides whether to answer or loop. That’s it.”

The 50-line implementation cloud brain, zero framework

Let’s build the thing. The core of every agent ever written is a function that looks roughly like this:

def run_agent(task: str, client: OpenAI, model: str = "gpt-4o-mini") -> str:
messages = [
{
"role": "system",
"content": (
"You are a helpful assistant. Use tools when needed. "
"When you have a final answer, respond without calling any tools."
),
},
{"role": "user", "content": task},
]

while True:
response = client.chat.completions.create(
model=model,
messages=messages,
tools=TOOLS,
tool_choice="auto",
)

message = response.choices[0].message
messages.append(message)

if not message.tool_calls:
return message.content

for tool_call in message.tool_calls:
name = tool_call.function.name
args = json.loads(tool_call.function.arguments)
fn = TOOL_FUNCTIONS.get(name)
result = fn(**args) if fn else f"Unknown tool: {name}"
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": result,
})

That’s the whole architecture. Everything else LangGraph, CrewAI, AutoGen, the Claude Agents SDK is infrastructure layered on top of that pattern. Strip any of them down far enough and you’ll find a while loop, a messages list, and an if not tool_calls check hiding somewhere inside.

The critical line is if not message.tool_calls. If the model returns text without requesting any tools, it's signaling it has everything it needs to answer. The agent exits and returns that response. If the model requests tools, the agent executes them, appends the results to messages, and sends the whole conversation back for another round. The messages list is the agent's short-term memory every tool call and every result gets appended to it, so by the time the LLM decides it's done, it has seen everything it did and everything it learned from doing it.

To make this concrete, three simple tools: current date/time, a calculator, and a weather stub you’d replace with a real API call in production.

def get_current_date() -> str:
return datetime.now().strftime("%Y-%m-%d %H:%M:%S")

def calculate(expression: str) -> str:
try:
result = eval(expression, {"builtins": {}}, {})
return str(result)
except Exception as e:
return f"Error: {e}"

def get_weather(city: str) -> str:
return f"Weather in {city}: 72°F, partly cloudy"

Each tool also needs a JSON schema that tells the LLM what’s available, what arguments it takes, and what they mean. This is what the model actually reads when it decides which tool to call. Sloppy descriptions here = wrong tool calls later. The schema is documentation the model acts on, not just metadata you write once and forget.

Run it against a task that needs all three tools at once:

Task: What's today's date? Also, what is 15% of 847? And what's the weather in Tokyo?

>
calling get_current_date({})
> calling calculate({'expression': '847 * 0.15'})
> calling get_weather({'city': 'Tokyo'})

Answer: Today is 2026-05-11 09:14:22. 15% of 847 is 127.05.
The weather in Tokyo is 72°F and partly cloudy.

On the first turn the LLM identified all three tools, called them, got the results, and assembled the final answer. No orchestration layer. No dependency to install. The model just read the tool schemas, figured out it needed all three, executed them in parallel, and exited cleanly.

What you’re looking at is a complete mental model for every production agent you’ll ever use. Claude Code starting a subagent is this loop calling another loop. Cursor retrying a failed file write is this loop with error handling bolted on. GitHub Copilot Workspace planning a multi-step refactor is this loop running longer with a more complex system prompt. The shape is always the same.

One thing worth calling out: I used OpenAI here because it has the cleanest tool-calling interface for a tutorial, but this works with any OpenAI-compatible API Anthropic, Gemini, and most local inference servers all support the same pattern. The agent code has no opinion about which model is on the other end, as long as it speaks the protocol.

The full working version with all tool schemas is in sergenes/mini_agent on GitHub if you want to run it directly. Worth pulling down and stepping through with a debugger once watching the messages list grow in real time makes the memory model click in a way that reading about it doesn’t.

The ReAct loop the actual architecture underneath every agent you’ve used

Points:

  • The agent is a while loop, a messages list, and one if-statement everything else is scaffolding
  • messages is the memory every tool result gets appended before the next LLM call
  • Tool schemas are documentation the model acts on write them carefully
  • Every production agent you’ve used is this pattern with error handling, retry logic, and a longer system prompt

“LangGraph, CrewAI, AutoGen strip any of them down far enough and you’ll find a while loop and an if not tool_calls check hiding inside.”

The local model trap why Mistral 7B silently broke everything

Ollama exposes an OpenAI-compatible API, which means the agent code runs on a local model with exactly one change:

ollama_client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama",
)

answer = run_agent(task, ollama_client, model="qwen2.5")

That’s it. The agent has no idea whether it’s talking to OpenAI’s servers or a model running on your laptop. Same loop, same tool schemas, same exit condition. To get Ollama running: install from ollama.com, then ollama pull qwen2.5 and ollama serve. After that, the whole thing runs offline. No API costs, no data leaving your machine, no rate limits at two in the morning when you're debugging something you broke before bed.

Except here’s what actually happened when I tried it the first time.

I pulled Mistral 7B. Widely recommended, solid benchmark numbers, comes up in every “best local models” thread on Hacker News. Ran the same three-tool task. No errors. Clean output. And then I read it:

Answer: I need to call get_current_date() to find today's date.
Let me use the calculate tool: calculate(expression="847 * 0.15")...
The weather in Tokyo is probably warm this time of year.

Plain text. Describing tool calls in prose. Guessing the weather. response.tool_calls was empty on every turn, so the agent hit the if not message.tool_calls check, found nothing, and exited immediately with whatever the model had written.

My first instinct was that I’d broken the agent code. Spent a solid chunk of time staring at the while loop, checking the tool schemas, wondering if Ollama’s API response format was slightly different. It wasn’t. The code worked exactly as written. The problem was the model.

Mistral 7B doesn’t support OpenAI-style structured function calling. It was trained to describe actions in prose, not emit them as structured JSON. When it saw the tool schemas in the request, it understood conceptually what tools were available so it started narrating what it would call. But it never actually emitted the structured tool_calls object the agent was waiting for. The model hallucinated the syntax it thought I expected, and the agent politely returned that hallucination as a final answer.

This is the trap: the code gives you no error. No exception. No warning. The agent just exits on the first turn and you get back a paragraph of the model describing what it wishes it could do. If you’re not watching closely, you might not even notice the tools never fired.

The fix is model selection. Not all local models support structured function calling through Ollama, and the ones that don’t fail silently in exactly this way.
The models that reliably work:

qwen2.5          ✓  Strong tool calling, good reasoning
llama3.1 ✓ Solid across most tool schemas
mistral-nemo ✓ Works, occasionally verbose in reasoning
phi4 ✓ Surprisingly capable for its size
mistral 7B ✗ Prose descriptions, no structured calls
gemma2 ✗ Inconsistent — sometimes works, often doesn't

If your agent returns immediately without calling any tools, suspect the model before the code. Swap to qwen2.5 and see if the behavior changes. Nine times out of ten that's it.

There’s a deeper lesson here about the gap between benchmark performance and production behavior. Mistral 7B scores well on reasoning benchmarks. It generates coherent, useful text. But “coherent text” and “emits structured JSON tool calls” are different capabilities, and the benchmarks that get cited in model announcements don’t always test the latter. Before you commit to a local model for anything agentic, run the simplest possible tool-calling task first. Date, calculator, weather stub. If those three work, you’re probably fine. If the model narrates them instead of calling them, move on.

The mixed-mode pattern handles this more elegantly run the loop locally, delegate to a cloud model when the task genuinely needs it but that’s the next section.

Points:

  • Ollama’s OpenAI-compatible API means zero code changes to run locally one base_url swap
  • Not all local models support structured function calling silent failure is the default behavior
  • Mistral 7B describes tool calls in prose instead of emitting JSON; the agent exits cleanly on turn one
  • Test with a simple three-tool task before committing to any local model for agentic work
  • qwen2.5 and llama3.1 are the reliable starting points as of mid-2026

“The code gives you no error. No exception. No warning. The agent just exits and returns a paragraph of the model describing what it wishes it could do.”

MCP the protocol that finally fixes hardcoded tool hell

Here’s the problem nobody talks about when they show you the 50-line agent: every tool is hardcoded into the script. get_current_date, calculate, get_weather all defined in the same file, all manually added to the TOOLS list, all maintained by you. If you want another agent to use the same tools, you copy and paste. If you want to use tools someone else built, you rewrite their implementation into your format. If you update a tool, you update every script that hardcoded it.

This is fine for a tutorial. It becomes genuinely painful around the time you have three agents across two projects and you’re maintaining six copies of the same file read utility with slightly different signatures.

MCP Model Context Protocol is the standard Anthropic shipped in November 2024 specifically to fix this. The spec lives at modelcontextprotocol.io and the idea is straightforward: instead of hardcoding tool definitions into your agent, you point the agent at a server. The server advertises what tools it has. The agent discovers them, gets their schemas, and calls them exactly the same way it calls local functions. The agent doesn’t care whether a tool is a Python function in the same file or a service running on the other side of the internetas long as it speaks MCP, it works.

Think USB-C. Before USB-C, every device had its own connector and you maintained a drawer full of incompatible cables. MCP is the USB-C for AI tools one protocol, anything plugs in. GitHub has an MCP server. Slack has one. Postgres has one. Google Drive has one. You point your agent at any of them and the tools just appear, already described, ready to use. Someone else writes the implementation once; everyone else benefits.

Building an MCP server is almost comically simple with FastMCP:

from mcp.server.fastmcp import FastMCP

mcp = FastMCP("mini-tools")

@mcp.tool()
def to_uppercase(text: str) -> str:
"""Convert text to uppercase."""
return text.upper()

@mcp.tool()
def count_words(text: str) -> int:
"""Count the number of words in a string."""
return len(text.split())

if name == "main":
mcp.run()

That’s a complete, working MCP server. Ten lines. The decorator handles schema generation from the type hints and docstring no manual JSON. Run it as a subprocess and any MCP-compatible client can discover those tools via a JSON-RPC handshake. Claude Desktop, Cursor, your DIY agent, anyone. Publish the server, point a config at it, and it just works.

From the agent’s side, the MCP client starts the server as a subprocess and calls tools via JSON-RPC. The agent receives the tool list exactly the same way it would receive hardcoded definitions they show up in TOOLS, get passed to the LLM, get called the same way, return results the same way. The companion repo at github.com/sergenes/mini_agent includes a working mcp_client.py that demonstrates the full handshake if you want to see the plumbing.

The architectural shift here is bigger than it sounds. Before MCP, every team building agents was maintaining their own tool library, reinventing the same integrations “call GitHub API”, “query Postgres”, “read a Slack channel” across dozens of repos with incompatible interfaces. MCP turns those into shared infrastructure. One well-maintained server replaces hundreds of slightly-wrong copies.

It also changes what “building an agent” means in practice. The interesting work stops being “how do I wire up this API” and starts being “which servers do I need and how do I write a system prompt that uses them well.” The loop stays exactly the same. What changes is that the tool ecosystem is now the internet instead of whatever you’ve personally had time to implement.

The FastMCP library is worth bookmarking it’s the fastest way to expose your own tools to any agent in the ecosystem, and the decorator pattern makes the schema definition essentially free. Write the function, add the decorator, document the arguments, done.

Points:

  • Hardcoded tools don’t scale you end up maintaining copies across every project and agent
  • MCP is the standard protocol for tool discovery and calling agent points at server, server advertises tools, agent uses them
  • Building an MCP server with FastMCP is ten lines decorator handles schema generation from type hints
  • The ecosystem already has MCP servers for GitHub, Slack, Postgres, Google Drive, and hundreds more
  • MCP shifts the interesting work from “wiring APIs” to “writing system prompts that use shared tools well”

“Before MCP I was copy-pasting tool definitions across three different repos. Not great. Not sustainable. Definitely the kind of thing you notice at code review.”

Frameworks aren’t magic they’re load-bearing walls

Let’s be honest about what the 50-line agent is missing, because pretending it’s production-ready would be doing you a disservice.

No error handling. If a tool throws an exception, the agent crashes. No retry logic a flaky API call takes down the whole run. No way to pause for human approval before the agent does something destructive like deleting files or sending emails. No memory beyond the current conversation restart the process and it has no idea what it did ten minutes ago. No ability to spawn parallel subagents when a task is large enough to benefit from splitting. No context compression when the messages list grows long enough to hit the model’s limit. No observability into what the agent decided at each step and why.

That list is not a criticism of the 50-line version. It’s a description of what frameworks exist to solve.

LangGraph models the agent as a state machine with explicit nodes and edges. You define what happens at each step and what conditions trigger the next one. More setup upfront, but you get checkpointing the agent can pause mid-run and resume from exactly where it stopped. You get structured error handling at each node. You get human-in-the-loop steps where execution blocks until a human approves the next action. You get full observability into the graph traversal. If you’re building something where a failed tool call means a corrupted database or a mistaken API call means actual money spent, LangGraph is where you want to be. The docs are dense but the core concept clicks fast once you’ve built the naive loop yourself.

CrewAI and AutoGen go a layer higher multi-agent coordination. Instead of one agent with many tools, you define multiple agents with specialized roles: a researcher, a writer, a critic. Each has its own system prompt, its own tool access, its own model if you want. The orchestrator decides who talks to whom and in what order. Useful for complex tasks where different phases genuinely need different prompts a research phase that needs web search and summarization, followed by a writing phase that needs a totally different voice, followed by a review phase that needs to be adversarial. Trying to do all of that with one system prompt and one tool set gets messy fast. CrewAI and AutoGen are the frameworks worth reaching for when you hit that wall.

The managed runtimes Claude Agents SDK and OpenAI Assistants API trade control for speed. You hand off state management, tool routing, threading, and context compression to the platform. Less visibility into what’s happening, but dramatically less code to write and maintain. Worth it when you need to ship something in a week and the task doesn’t require custom orchestration logic. The Claude Agents SDK docs in particular are worth reading even if you don’t use the SDK the mental model they describe maps directly onto everything we’ve covered here.

Claude Code is the honest comparison point for the DIY version. It does things the 50-line agent can’t touch: starts subagents with isolated context windows when a task is too large for one loop, prompts for confirmation before running destructive shell commands, maintains persistent memory across sessions, retries failed tool calls with adjusted parameters, compresses prior messages when approaching context limits. My agent has six tools, one messages list, and no safety net. Claude Code bills per message. My agent costs nothing until I tell it to hit GPT-4.

If I need to ship something reliable and I don’t want to think about orchestration, I use Claude Code. If I’m prototyping something I’d spend a week fighting a framework to implement, I start from the loop. That tension is actually healthy. Knowing what frameworks abstract means you can make the decision deliberately instead of defaulting to LangGraph because it’s what the tutorial used.

The 50-line version is a sketch. LangGraph is that sketch turned into a building with proper load-bearing walls. You need to understand the sketch before you can reason about the building. Now you do.

Points:

  • The naive loop is missing error handling, retries, human-in-the-loop, persistent memory, parallel subagents, and context compression frameworks exist to solve exactly these problems
  • LangGraph: state machine model, checkpointing, structured error handling, best for production reliability
  • CrewAI / AutoGen: multi-agent coordination, specialized roles, best when different phases need genuinely different prompts
  • Managed runtimes (Claude Agents SDK, OpenAI Assistants): fastest to ship, least control, worth it for tight deadlines
  • Use frameworks when you’ve hit the problem they solve not before

“The 50-line loop is a sketch. LangGraph is that sketch turned into a building with proper load-bearing walls. Understand the sketch first.”

What building this actually taught me

I started this because I couldn’t answer a question in a code review. I ended up with something more useful than the answer a complete mental model I couldn’t get from using production tools, no matter how much I used them.

That’s the thing about abstractions. They’re designed to hide complexity, and they’re good at it. Cursor doesn’t show you the while loop. Claude Code doesn’t surface the messages list growing with each tool call. LangGraph definitely doesn’t make you think about the if not tool_calls check at the center of everything. The abstraction works, which means you can ship fast without understanding the foundation. Until you can't. Until something breaks in a way the framework doesn't have a handler for, or you need behavior the abstraction actively fights against, or someone asks you in a review to explain what's actually happening under the hood.

Building the naive version closed that gap for me. I can see now exactly where an agent gets stuck when the system prompt is underspecified and the model doesn’t know when to stop. I can see why it picks one tool over another because the schema description was more precise. I can see when adding more tools actually makes things worse because the model starts hallucinating tool calls for tools that don’t quite fit the task, instead of admitting it can’t do it. That visibility is worth more than any framework feature.

Some of my projects will use LangGraph or the Claude Agents SDK going forward. Those frameworks solve real problems I genuinely don’t want to reimplement retry logic, checkpointing, human-in-the-loop approvals. But some will start from this 50-line loop, because I know exactly what it does and I can modify it without fighting abstractions I don’t fully understand. That optionality is the real output of the exercise.

Here’s the slightly uncomfortable opinion to end on: too many developers are reaching for agent frameworks before they understand what’s being abstracted. The ecosystem moved fast LangChain appeared, then LangGraph, then CrewAI, then a dozen managed runtimes, all within about eighteen months. The tutorials went straight to the framework. A lot of people never saw the loop. That’s going to become a problem as agents get more embedded in production systems and the failure modes get more consequential. You can’t debug what you don’t understand, and “the framework handled it” stops being a satisfying answer when the thing the framework handled was your customer’s data.

Build the naive version first. Then decide what infrastructure you actually need. The loop is twenty minutes of work. The clarity it gives you is permanent.

What did you build to understand something you were already using? Drop it in the comments I read every one.

Helpful resources

Top comments (0)