I Watched an AI File a Bug Report, Fix the Code, and Run the Tests. I Didn't Touch the Keyboard.

#agents #ai #automation #softwaredevelopment

I want to tell you about a moment that genuinely shifted how I think about software.
I gave an AI agent one instruction. One sentence. And then I watched it think.
It read my project files. Then it read more files — imports, type definitions, the test suite. It made a plan. It wrote code. It ran the tests. Two of them failed. It read the error messages, understood why they failed, revised the code, and ran the tests again. They passed. It opened a summary of everything it had done.
I had typed eleven words.
That's not autocomplete. That's not a fancy search engine. That's something that didn't have a good name until recently. We're calling it an agent and understanding what's actually happening under the hood changes how you build with AI entirely.

The moment the definition clicked for me

Most people encounter AI as a question-answer machine. You ask, it answers, done. One turn. Stateless.
Agents are different. They operate in a loop:

Observe — read the current state of the world (files, error messages, API responses, whatever)
Think — decide what to do next
Act — call a tool, write a file, run a command
Repeat — feed the result back in and go again

That's it. That's the whole thing. Observe → Think → Act → Observe → Think → Act, until the task is done.
You already run this loop. Every morning when you wake up and check your phone, see you have a 9am meeting, decide to leave early, check traffic, reroute — that's the same loop. The AI version just runs it faster, with more tools, and without needing coffee.

How tool calling actually works

Here's the part that surprised me when I learned it: the LLM doesn't need to be programmed to use a tool. It figures it out from the description.
You give the model a list of available tools — what each one does, what parameters it takes. The model reads that list the same way you'd read a manual. When it decides a tool is appropriate, it outputs a structured call instead of text:
json{
"tool": "weather_api",
"input": {
"city": "Mumbai",
"country": "India"
}
}
Your system executes that call, gets the result, and feeds it back into the model's context. The model reads the result, decides what to do next, and either calls another tool or generates a final response.
The agent isn't magic. It's an LLM in a loop with access to functions it can call.
python# Simplified agent loop
while not task_complete:
action = llm.decide(current_context, available_tools)

if action.type == "tool_call":
    result = execute_tool(action.tool, action.input)
    current_context.append(result)  # feed result back in
elif action.type == "final_response":
    task_complete = True
    return action.content

That's the entire architecture. Everything else — Claude Code, Cursor, Devin — is a variation of this loop with better tooling around it.

MCP: the piece that makes it composable

Before mid-2024, connecting an AI agent to a new tool meant writing custom integration code every time. Want your agent to read Google Drive? Custom code. Search Slack? Different custom code. Query your database? Even more custom code.
This is the equivalent of a world where every phone had its own charger. You'd need a bag full of cables for three devices.
Anthropic published the Model Context Protocol (MCP) to solve exactly this. It's an open standard that defines how any AI agent talks to any tool or data source. You write an MCP server once for a tool, and every MCP-compatible agent can use it.
Your AI Agent
| (MCP Protocol — one universal standard)
├── MCP Server: File System
├── MCP Server: Web Search

├── MCP Server: GitHub
├── MCP Server: Slack
├── MCP Server: Your Database
└── MCP Server: Whatever You Build
This is why Claude can search the web, read your files, create documents, and query databases through the same interface. Each capability is an MCP server. Build a new server for any tool, and it instantly works with every agent that speaks MCP.
The protocol is open source. The community is already building servers for everything. If you want to give your agent access to a new tool, you're often one npm install away from an existing MCP server.

Walking through what actually happened (in detail)
Let me reconstruct that eleven-word moment from the beginning. I typed:

"Add a login page to this React app with email/password auth."

Here's what the agent actually did:

Loop 1 — Orient:
Read package.json. Noted: React 18, react-router-dom v6, CSS modules, TypeScript. Read App.jsx to understand the existing routing structure. Read a few component files to absorb the coding patterns.

Loop 2 — Plan and build:
Created Login.jsx — form with email/password fields, useState for form state and error handling, loading state during async auth, error message display. Matched the CSS module pattern it had seen in other components.

Loop 3 — Add styles:
Created Login.module.css using the same color variables, spacing, and responsive patterns from the existing design system. Not generic CSS — the specific design system it had reverse-engineered from the codebase.

Loop 4 — Update routing:
Read App.jsx again, added the login route and an auth guard for protected routes.

Loop 5 — Create auth utilities:
Wrote auth.js with validateEmail(), validatePassword(), and a login() function with proper error handling for network failures.

Loop 6 — Run tests:
FAIL src/pages/Login.test.jsx
✗ should validate email format
Expected: 'Please enter a valid email'
Received: 'Invalid email address'

Loop 7 — Debug:
Read the test file. Understood that the tests were written before the component. Changed the error message to match the expected string.

_Loop 8 — Verify:_
Tests pass. Build succeeds. Done.
Eight loops. One sentence from me.

Why this isn't just "better autocomplete"

Autocomplete predicts the next character or line. It has no memory of what just ran in the terminal. It can't decide to run the tests, read the failure, understand it, and fix the source.
Agents operate over time and state. They maintain a growing context of what has happened — what files exist, what commands returned, what errors appeared — and they use that context to make decisions across many steps.
The difference is the loop. Without the loop, you have a smart text predictor. With the loop plus tools, you have something that can actually execute a workflow.

What this means for how you build

A few things I've changed in how I think about AI after internalizing the agent architecture:
Stop thinking in single prompts. If you're writing one massive prompt trying to get the model to do everything at once, you're fighting the architecture. Agents are designed to work iteratively. Let them.
Tools are leverage. The quality of an agent is largely determined by what tools it has access to and how well those tools are described. A mediocre model with great tools often beats a great model with no tools.
Context is everything. The agent is only as good as what's in its context window at decision time. This is why products like Cursor are so powerful — they're doing aggressive, intelligent context injection before every LLM call.
MCP is worth learning now. The ecosystem is moving fast. If you start building MCP servers for tools your workflow depends on, you're building once and benefiting from every future agent that speaks the protocol.

The honest picture

Agents also fail in interesting ways. They can get stuck in loops, take wrong turns and double down on them, use the wrong tool for a job, or run up large costs on simple tasks. The observe-think-act loop is powerful, but it's only as reliable as the model's judgment at each decision point.
The field is still figuring out how to make agents reliably safe for high-stakes actions — things where a wrong file write or a wrong API call can't be undone. Humans in the loop, permission systems, and careful tool design are the current answers.
But for tasks that are well-defined, reversible, and code-related? We're already there.

The bigger picture

LLM → RAG → Agents. That's the progression.
The LLM is the brain: it reasons, generates, understands.
RAG is the memory: it retrieves relevant context on demand.
Agents are the hands: they act, iterate, and complete tasks autonomously.
That morning I spent watching an agent navigate my codebase, I realized I was watching something that would have seemed like science fiction to me five years ago. Not because any single piece is magic — each piece is just software — but because of what they become when you connect them in a loop.
A brain, with memory, that can act.
That's what's being built right now.

What's your experience with agents so far — are you using Claude Code, Cursor, or something else? Drop a comment, I'm collecting data on what's actually working in production

DEV Community

I Watched an AI File a Bug Report, Fix the Code, and Run the Tests. I Didn't Touch the Keyboard.

Top comments (0)