Building My Own AI Coding Agent From Scratch: A Learning Journey

Tosin okunniga — Mon, 27 Apr 2026 21:45:43 +0000

A €5 VPS. A Telegram bot. A Python script that kept growing. This is the story of what I learned by choosing to build rather than borrow, warts, dead ends, and all.

The Decision

In early 2025, AI coding tools were everywhere. Copilot in your editor, Cursor rewriting your files, Devin promising to replace entire sprints of work. The open-source side was just as active. OpenClaw had become the fastest-downloaded agent on GitHub, and anyone serious about local inference was buying a Mac Mini to run models at home.
I was tempted by both options. But I kept asking myself an uncomfortable question: if I just install one of these, do I actually understand what it's doing? The model routing, the conversation management, the diff review loop, the fallback chains when an API goes down at 2am. Do I understand any of that?

I didn't. And rather than shortcutting past that gap, I decided to sit in it for a while. Not because building from scratch is always the right call (it usually isn't), but because I wanted to learn by doing, even if what I built was rough around the edges.

So I set some constraints. Cheap server. Telegram as the interface, since it's always open on my phone. Python, because I know it well. And a rule: no pretending the hard parts don't exist. I spun up a Hetzner VPS for a few euros a month, wired it to Telegram's bot API, and started writing.

What followed was not a smooth arc from prototype to polished product. It was a series of painful discoveries, each one teaching me something I wouldn't have encountered by downloading somebody else's solution. The agent is still rough in places. It's not production-ready, and I'm not claiming it is. But it works, it's mine, and I understand every part of why.

It Started Embarrassingly Simple

The first version was about 80 lines of Python. A Telegram message arrived, got forwarded to Anthropic's API, and the reply came back. No memory, no git integration, no diff review. Just a slightly expensive chat window with a bot icon.

The first time I pointed it at a real repo, it was a complete mess. The bot misunderstood the task, edited the wrong file, and committed something that broke the build. I spent more time cleaning up after it than the fix would have taken manually. Not exactly the "AI takes over your workflow" moment I had imagined.

That weekend prototype became the skeleton of everything that followed. What it also exposed, almost immediately, was a question I hadn't thought through: how exactly does an LLM edit a file?

The Bash Nightmare

The obvious answer is shell commands. Ask the model what to change, have it output a sed invocation, run it through subprocess.run(). I tried this. It was a disaster.

The escaping problem came first. A sed command that works in your terminal will randomly fail when an LLM constructs it and Python passes it to a subprocess. Special characters in the replacement string (backslashes, ampersands, quotes, newlines) follow escaping rules that vary between sed on Linux and macOS, between single-quoted and double-quoted contexts, between inline and -i mode. The model would produce something that looked correct:

sed -i 's/def old_function/def new_function/g' src/api.py

And it would either silently do nothing, silently corrupt the file, or throw an error whose message had no obvious connection to the actual problem.

The second issue was semantic precision. sed replaces text patterns, not code constructs. Tell it to update a function signature and it might replace every occurrence of that string across the entire file, including comments, docstrings, and a different function with a similar name. Or it would match nothing because the file used two spaces where the model expected one.

I tried awk. Same category of problems, steeper learning curve to debug.

I tried having the model output a short Python script that would open the file and rewrite the relevant lines. This worked slightly better, right up until the editing script itself had a bug. At that point the error message was about the meta-code, not the original task.

The deepest problem with all of these approaches is that bash is stateless in exactly the wrong way. A command either exits 0 or it doesn't. There's no introspection, no "here's what I was trying to do when this failed." When the agent ran sed on the wrong file, or ran it twice, or ran it on a path that didn't exist yet, the only signal was silence or corruption.

The breaking point was a session where the agent introduced a syntax error into a working file, failed to fix the bug I had asked it to fix, and then replied "Done!" with apparent confidence. I needed something better.

Aider changed the dynamic entirely. Rather than generating shell commands that treat source files as raw text, Aider understands code as a structured thing. It reads the file, locates the right construct, applies the change, and writes back. When it fails, it fails with context rather than a corrupted file and a zero exit code. The tradeoff is that Aider is a subprocess with its own opinions, its own timeouts, and its own quirks. But its failure modes are legible, and legible failures are fixable. Silent failures are not.

I learned something from this that applied everywhere else in the project: the right tool is the one that fails in a way you can understand.

Model Routing Is a Product Problem

With Aider handling file edits, the next question was which LLM to put behind it and what to do when that LLM has a bad day.

I started with Anthropic exclusively, but API credits run out and models go down. I added Groq for the free tier. Then I found OpenRouter: a single API endpoint that proxies dozens of models. One API key, one client class, and you can switch between DeepSeek, Qwen, Mistral, and a dozen others by changing a string. For a small project running on a budget VPS, that flexibility matters.

But "switch to OpenRouter" undersells the actual problem. In a production agent loop, models fail constantly and in different ways:

5xx errors from overloaded providers
Rate limits that return an error after a 30-second wait
Models that return 200 OK but produce malformed output
Models that accept tool-calling syntax but don't actually execute tools

A naive implementation treats any of these as fatal. A more careful one handles each differently. I built a layered fallback:

Primary model
  → retry same model (up to 2x with exponential backoff on 5xx)
    → fallback model (e.g., DeepSeek → Claude Haiku)
      → Groq circuit breaker (back off for the remainder of the rate-limit window)

One detail that bit me early: not every model supports tool-calling in the structured sense that agentic loops require. Sending tool-use JSON to a model that doesn't understand it doesn't produce an error. It produces a garbled reply that silently breaks the task. The model registry tracks this explicitly:

MODELS = {
    "deepseek": {"provider": "openrouter", "id": "deepseek/deepseek-chat",            "supports_tools": True},
    "qwen":     {"provider": "openrouter", "id": "qwen/qwen-2.5-coder-32b-instruct",  "supports_tools": False},
    "haiku":    {"provider": "anthropic",  "id": "claude-haiku-4-5-20251001",          "supports_tools": True},
}

When a task requires tool-calling and the selected model doesn't support it, the agent steps up to one that does without the user knowing or caring.

The Groq circuit breaker deserves its own mention because it solved a genuinely painful problem. Groq's free tier is generous but has strict daily token limits. Early on, every background process (message classification, conversation compaction, post-task reflection) would independently call Groq until the daily quota was gone. By mid-afternoon, the agent was silent. A shared circuit breaker fixed this. When any one caller hits a rate limit, a global flag stops every other caller from trying during the same backoff window.

def groq_available() -> bool:
    return time.time() >= _GROQ_BACKOFF_UNTIL

def groq_mark_rate_limited(retry_after_seconds: float) -> None:
    global _GROQ_BACKOFF_UNTIL
    _GROQ_BACKOFF_UNTIL = time.time() + retry_after_seconds

One failure, system-wide protection. Simple, and it works.

The Dispatcher: Knowing When Not to Code

Once model routing was stable, a different problem surfaced. The agent couldn't tell the difference between a coding instruction and a casual message.

Typing "hello" routed the bot to Aider, which would reply asking for files to add to its context. Typing "what did you change?" started a new job. Everything was a task.

The fix was a lightweight dispatcher: a fast LLM call before any routing decision, with a strict two-output contract.

If the user is asking you to do coding work → output the single word: TASK
If the user is chatting, asking questions, or confirming → output a helpful reply
NEVER output both. "yes", "ok", "sure" are ALWAYS conversational.

The contract had to be that explicit. Early versions would classify a message correctly but then write a summary alongside it, something like "I'll get right on that. TASK", and the presence of the word TASK anywhere in the response would trigger a job with a nonsensical description.

The dispatch chain runs Anthropic Haiku first (fast, cheap, good at classification), then Groq Llama as a backup, then a regex heuristic for obvious phrases (hi, thanks, what did you do, explain), and only if all of that fails does it default to treating the message as a task. Each fallback is cheaper and dumber than the one before it, but the chain as a whole handles the common cases reliably.

Conversation Management: The Problem Nobody Blogs About

Routing messages correctly solved one problem and revealed a deeper one. Conversations across different jobs were contaminating each other.

A session from the previous day, where Aider had asked to add Flask files to its context, would leak into a completely different job the next morning. The agent would reference those files, ask for things unrelated to the current task, behave as though it didn't know which job it was working on. The bug was invisible until the output was obviously wrong.

Three separate issues, three separate fixes.

Session bleed. The conversation history wasn't cleared between jobs. Whatever noise Aider produced during the last session was still in context for the next one. Fix: call convo_clear(user_id) when any new job starts.

Context explosion. Conversation history grows with every turn. After 40 turns you're feeding thousands of tokens of stale context to every new request, paying for tokens that actively make the output worse. Fix: compact old turns. When history exceeds 30 messages, the oldest half gets summarised into a single bullet-point block using a lightweight Groq call. The agent sees the essence of past context, not a wall of raw text.

async def _compact_history(user_id: int) -> None:
    hist      = _conversations.get(user_id, [])
    cutoff    = len(hist) - _COMPACT_AFTER
    to_squash = hist[:cutoff]
    keep      = hist[cutoff:]
    summary   = await _call_summary_llm(to_squash)
    if summary:
        _conversations[user_id] = [
            {"role": "assistant", "content": f"[Earlier conversation summary]\n{summary}"}
        ] + keep

History contamination. Aider's raw output is verbose and full of tool-call artefacts. Storing it verbatim means future prompts inherit noise instead of signal. Fix: when a job completes, store only the first line of its result, a one-sentence summary of what was done.

All three together made the agent stop confusing its own past with whatever you're asking it to do right now.

Git Operations Needed Their Own Layer

With coding tasks working reliably, git operations were the next thing to break. Cherry-picks, branch comparisons, selective file checkouts: all of them ran into the same wall.

The executor deliberately blocks shell composition operators: $(), &&, |, ;. Running arbitrary shell pipelines on a remote server is dangerous, and the controlled environment is worth the restriction. The problem is that every LLM's first instinct for "iterate over a list of files" looks like this:

for file in $(git diff --name-only origin/branch); do
    git checkout origin/branch -- "$file"
done

The model would generate this, the executor would reject it, and the task would fail with an unhelpful error message.

Three changes fixed it, and all three were needed.

First, a _is_git_task() function detects when the approved plan is primarily git operations and routes away from Aider entirely. Aider doesn't handle git checkout. It tries to open a chat session and ask for files to add.

Second, a git-specific skill block gets injected into the system prompt with an explicit CRITICAL section:

CRITICAL. No shell operators permitted.
  WRONG: git checkout $(git diff --name-only)
  WRONG: for file in $(...); do ...
  RIGHT: run "git diff --name-only" first. Read the output. Then run
         "git checkout origin/<branch> -- <file>" once per file as a separate step.

Third, _inject_plan_steps() takes the approved plan steps and embeds them directly into the task message as numbered instructions. The agent doesn't have to infer what to do from a vague description. It gets explicit, ordered commands.

Any one or two of those mechanisms alone wasn't enough. All three working together is what made git tasks actually reliable.

The Review Loop: Staying in Control

All of this infrastructure exists to support one thing: a diff review loop that keeps you in control before anything gets committed.

When the agent finishes a job, rather than committing immediately, it presents the changes with a structured review:

A summary of what changed and why
Any risks or unintended side effects it noticed
A conventional commit message with type, scope, subject, body, and justification

From there you can approve, revert, or keep talking. Ask "why did you change this line?" and the agent answers using the actual diff as context. Give a follow-up instruction and it applies further changes on the same branch before showing you another diff. Only an explicit approve <job_id> triggers the commit.

You:  fix the auth middleware
Bot:  📋 Plan #42 [5 steps] - reply "approve 42" to proceed
You:  approve 42
Bot:  Done. Here's what changed:
        Modified: src/auth/middleware.py, tests/test_auth.py
        feat(auth): replace session token storage to meet compliance requirements
        Risk: existing sessions will be invalidated on deploy
        Reply "approve 42" to commit, "revert 42" to cancel, or describe further changes.
You:  why did you touch line 47 in middleware.py?
Bot:  Line 47 contained the old token serialisation logic. The new approach signs
      the token with HMAC-SHA256 before storing it, which is what the compliance
      requirement specifies. I moved the signing step here so it applies to all
      token creation paths, not just the login endpoint.
You:  approve 42
Bot:  Committed. Reply "push 42" to push to origin.

This matters because the agent does make mistakes. It changes more than it should, or gets the right thing but in the wrong place. Having a structured conversation about the diff before the commit is where those mistakes get caught, not after they've landed in the repo.

Memory: Not Repeating the Same Mistakes

Over time I noticed a frustrating pattern. The agent would repeat errors that had already been fixed. It would use pytest on a project that required pytest -x --no-header. It would try to write a file without reading it first. Things I'd already debugged would reappear in the next session as if nothing had happened.

The fix was persistent memory, and for this project that meant SQLite.

The choice is almost boring to explain. SQLite is already on the server, requires no setup, never goes down, and costs nothing. I briefly looked at alternatives. Postgres is overkill for a single-user agent on a €5 VPS. A hosted database like Supabase adds a dependency, a credential to manage, and a recurring cost for data that has no business being remote. Redis would give me sub-millisecond reads on data I query a few times per task. That's solving a problem that doesn't exist.

SQLite is just a file sitting next to the agent code. Backups are cp. Inspection is sqlite3 memory.db. That's exactly the right level of complexity here.

Two tables, two jobs:

lessons accumulates knowledge over time. Repo-specific quirks, global patterns, things that went wrong and how they were fixed. A hit_count column tracks how often each lesson gets injected into a prompt. Useful lessons naturally rise to the top. Stale ones fade without any manual curation. At the start of each job, the most relevant lessons go straight into the system prompt.

history is a rolling log of the last 100 tasks: what was asked, what the outcome was, which model handled it. This gives the agent recent context without relying on in-memory state that disappears when the process restarts.

After every completed task, a Groq reflection call pulls lessons from the outcome:

REFLECT_PROMPT = """
Extract concise, actionable lessons from this task outcome. Focus on:
- Mistakes made and how they were corrected
- Repo-specific conventions: test commands, build tools, file structure
- Patterns that worked and should be repeated
Return ONLY valid JSON: {"global_lessons": [...], "repo_lessons": [...]}
"""

The agent also reflects on conversations, not just task outcomes. After each exchange, a separate pass extracts facts about how the user likes to work: communication style, technical level, preferred tools. These shape how the dispatcher responds and how the agent communicates over time.

Deploying Without Leaving Telegram

Once the agent was useful enough that I was pushing changes to it regularly, a small friction point started adding up. Every push meant SSH-ing into the server, running git pull, and restarting the process. It sounds minor. It stopped feeling minor around the sixth time in one evening.

The obvious fix was GitHub Actions: a workflow that SSHes into the server on every push and restarts the service. But that means storing server credentials in GitHub Secrets, burning Actions minutes, and adding a whole CI layer to a personal tool that runs on €5 of compute. Not worth it.

Instead I added a /update command directly to the bot.

You: /update
Bot: Pulling latest code...
     git pull: 3 files changed, 47 insertions(+), 12 deletions(-)
     Restarting agent...
     Agent restarted. Now running: bf7e8f4

Under the hood it runs git pull on the server repo and then systemctl restart agent. About 15 lines of Python. No pipeline, no third-party credentials, no billing.

The bot is already the authentication boundary. Any /update request has already passed through the same check as approve, revert, and everything else. There's no reason to build a parallel deployment channel when the one you've already got works perfectly.

The workflow now: push to GitHub, type /update in Telegram, done. The server pulls and restarts in a few seconds. No SSH, no CI, no cost beyond the VPS.

What Building It Taught Me

Looking back, choosing to build rather than download produced a kind of understanding I couldn't have gotten any other way.

Model routing is a product decision. Picking the right model for each task (cheap and fast for classification, capable for code generation, free-tier for background reflection) is a real engineering problem with real cost implications. None of it is obvious until you've watched the loop break in three different ways.

Conversation state is where agents actually fall apart. Every article about agents talks about prompts. Almost none of them talk about what happens to conversation history after 40 turns, or what happens when two sessions bleed into each other. That's where reliability actually lives, and it's completely invisible when it's working.

The review loop is the point, not the workaround. Having a back-and-forth about the diff before committing is more useful than just letting the agent commit directly. It makes mistakes. The conversation catches them.

Cheap infrastructure goes further than you'd think. A €5 VPS, free-tier Groq, SQLite, and OpenRouter pay-per-token costs less per month than a single Claude Pro subscription. The economics are surprisingly good once you stop paying for abstraction layers you don't need.

You can't shortcut the learning. I now have a real working understanding of model fallback chains, conversation lifecycle management, tool-calling protocols, and diff review loops. Not from reading about them but from writing code that broke in those exact places and having to fix it. I'm still learning. But the gap between where I started and where I am now only exists because I chose to build.

What's Next

There's still a lot to figure out. The agent works well enough that I use it regularly, but I'm under no illusion that it's finished. If anything, building it has surfaced more questions than it's answered.

Right now it needs to be told which repo a task belongs to. I'd like it to pick that up from context instead of always waiting to be told. Test integration is another gap: being able to run the test suite after a change and include the results in the diff review before asking for approval would make the whole loop a lot more trustworthy. I also want to explore webhook triggers so a GitHub push can kick off the agent directly without me needing to type anything in Telegram.

Honestly, the list keeps growing the more I use it. Every time I run a task I notice something that could be better. That feels like a good sign.

The code is messy in places. Some of the fallback logic is held together with pattern-matching and stubbornness. It's not production-ready and I'm not trying to pretend otherwise. But it's mine, I understand it, and that was the whole point of building it in the first place.

Stack: Hetzner VPS (€4.51/month) · python-telegram-bot · Anthropic API · OpenRouter · Groq (free tier) · Aider-chat · SQLite
Code: github.com/Teegold007/my-agents
If you're building something similar or have questions about any of the decisions here, feel free to reach out.

DEV Community: Tosin okunniga