DEV Community: Pawel Jozefiak

I Built a Self-Improving AI Agent. Here Is What Made It Learn.

Pawel Jozefiak — Fri, 15 May 2026 08:00:01 +0000

Six months ago I started building what I'd call a self-improving AI agent. Not in the academic sense, not with RL loops. Just a practical personal system: my daily driver agent that observes its own mistakes, files them, and by the next morning has already learned from them.

The layer I kept getting wrong was memory. Every time something went sideways, my instinct was to add another memory file. Longer context. More rules. More notes to self.

It did not work. Not because memory is wrong, but because "memory" is one word covering four completely different jobs. Once I separated them, the improvement started showing up in the data.

Here's what's inside this post:

The corrections loop (capture, classify, graduate): The three-stage pipeline I built to ensure no correction ever expires unaddressed. Capture is a single Python call. Classify is a regex map of 7 patterns into 6 kinds. Graduate is a nightly job that picks the right file. The whole thing costs nothing if a session is under pressure -- which matters more than it sounds.

Four memory sinks: Working memory that decays. Lessons that read like engineering postmortems. Per-rule feedback files (one rule, one file, linkable and deduplicatable). Always-loaded top-level RULE lines the agent wakes up next to. Different lifespans, different files, different decay curves.

A Basecamp card for every correction: The human-in-the-loop surface that keeps the model from grading its own homework in private. Every correction lands as a card I can read, push back on, fold into another, or trash. Corrections become reviewable. That is the part I would build first if starting over.

The actual numbers (30-day window): 22 corrections in 30 days, trending to 18 in the last 7. 93.5% task success rate. The analyzer flagged 2 repeating themes -- both of which proposed rules that have already graduated to always-loaded RULE lines. The loop is closing on itself in data I can read off the file.

What would break it: Letting the model grade its own corrections in private. Yohei Nakajima wrote the clearest version of this risk. The failure mode is the model hallucinating bad reflections and reinforcing them. The Behavioral Learning card table is my only guardrail against that.

If you have ever made the same correction to your agent twice, this is the post for it.

Full post: https://thoughts.jock.pl/p/i-built-a-self-improving-ai-agent

Free weekly: https://thoughts.jock.pl

The Bounded AI Agent

Pawel Jozefiak — Mon, 11 May 2026 08:00:37 +0000

> Wiring the agent into a $5 notes app I cannot stop using, why Opus 4.7 sent me back to ChatGPT Pro at $200 a month, the local-LLM experiment that nearly fried my Mac Mini while I was in the mountains, and what an AI agent actually does to an ADHD brain.

Episode 5 of Digital Thoughts is a status report on what living with an always-on AI agent actually feels like once the novelty wears off and the bills show up. The thread holding it together is *Capacity, Not Capability* — the constraint that matters isn't how smart the model is, it's how much of my day it can actually hold.

Here's what's inside:

- **The $5 notes app that became the agent's front door.** I'd been bouncing between Notion, Obsidian, and Apple Notes for years. The thing that finally stuck wasn't the prettiest or the most powerful — it was the one the agent could read and write to without friction. I get into why that single property reorganized my whole capture loop.

- **Why Opus 4.7 sent me crawling back to ChatGPT Pro.** I wanted to love it. The reasoning is genuinely better. But $200/month buys a kind of throughput that changes how you use the thing, and I underestimated how much that mattered until I tried to live on metered Opus for a week.

- **The local-LLM experiment that nearly cooked my Mac Mini.** I was in the mountains, the agent was running unattended, and the thermals on a headless M-series box doing sustained inference are... not what the marketing suggests. What I learned about where local actually pays off, and where it absolutely does not.

- **What an agent does to an ADHD brain.** This is the part I didn't expect to write. The agent isn't a productivity multiplier for me — it's a working-memory prosthetic. That distinction changes which features I care about and which ones I now ignore.

The part that surprised me most — the ADHD section — is in the full post.

Full post: https://thoughts.jock.pl/p/the-bounded-ai-agent-ep5

Free weekly: https://thoughts.jock.pl

How to Use Git(hub) When You're Building with AI (Basics)

Pawel Jozefiak — Sun, 10 May 2026 08:00:32 +0000

This is part three of my Basics series. The first post was about how I structure CLAUDE.md after 1,000+ sessions, the instructions file that tells your AI agent who it is and how to behave.

This one is about Git and GitHub, the version control layer underneath all of it, written for builders who are shipping with AI agents but never formally learned the workflow.

Here's what's inside:

Git vs. GitHub, the actual difference: Why Git is the tool on your machine and GitHub is just one of many hosts (GitLab, Codeberg, Forgejo). If you've been using the words interchangeably, this is the part that clears it up.
The setup that actually works: Installing Git, configuring identity, initializing a repo, first commit, pushing to remote. The exact commands I run on a fresh machine, in order.
Commits as checkpoints, not chores: When to commit and what to write in the message. The rule I follow: commit before any big change, so I can roll back when the AI agent goes sideways.
Branches and worktrees for parallel AI work: How I isolate experiments and run multiple agents at once without them stepping on each other. This is the part most tutorials skip.
Why AI agents read your commit history: Claude Code, Codex, and the rest use your git log to understand context. Sloppy commits give you sloppy sessions. Clean commits make the agent smarter.

The part that surprised me most when I started doing this seriously is in the full post.

Full post: https://thoughts.jock.pl/p/how-to-use-github-ai-builders-basics-2026

Free weekly: https://thoughts.jock.pl

Building Your Own Things Is Cool Too

Pawel Jozefiak — Fri, 08 May 2026 23:07:52 +0000

People ask me a version of the same question all the time. "Why are you spending your evenings building your own thing? There is already a tool that does this."

This post is about why I keep choosing the harder path of building from scratch when off-the-shelf tools exist, and what that choice has actually given me back.

Learning by doing vs. learning by reading: I unpack why building something yourself teaches you things no tutorial or doc can, and why I treat construction as my primary learning loop.
The "rediscovering America" pushback: The most common critique I get, and why I think it misses what actually compounds when you build instead of integrate.
AI agent as a case study: A concrete example of where I ignored the popular frameworks and built my own, including what I now understand about variables, load-bearing pieces, and failure modes that I'd never have seen as a user.
The WizBoard pivot: When I do choose to use an existing tool instead of rolling my own, and the rule I use to decide.

The part that surprised me most is the cost section. Building this way is not free, and I'm honest about what it takes.

Full post: https://thoughts.jock.pl/p/building-your-own-things-is-cool-too-2026

Free weekly: https://thoughts.jock.pl

I Have ADHD. My AI Agent Is the Best and Worst Thing for It.

Pawel Jozefiak — Thu, 30 Apr 2026 15:05:24 +0000

I have ADHD. I run an AI agent. Those two facts interact, and the interaction is not a clean win. It is a trade.

I wrote the long version on my blog this week. Here is the compressed view for fellow builders.

The bad part. An agent lets you start almost anything in a sentence. The natural friction that used to protect me from myself (ideas either died quietly or I dropped everything) is gone. An agent can hold eight open threads. My brain holds one. Output goes up, attention per thing goes down. A productivity paradox I walked straight into.

The good part, which is bigger. An agent absorbs the operational layer of work. The planning, the boilerplate, the follow-through, the small continuous labor that used to tax me twice. I have ideas today that two years ago would have stayed ideas, not because they were bad but because the execution cost was more than my brain could pay. Now I have ideas and prototypes of those ideas. I choose between working things.

The personal-trait angle. I adapt to new environments and tools fast. With an agent, that trait is multiplied. Every time I find a better way to hand work over, the whole system gets faster, and the cost of trying a new workflow is one voice note. If you share that trait, the agent era is built for you.

How it works concretely. I describe an idea whenever it hits, sometimes quickly, sometimes as a long dictated note. The agent writes it to the right place and, if there is enough context, picks it up during the night shift or a day shift. I come back to a Discord message or email saying "here is a thing, take a look." A minute to know if I want to keep going.

Three things that have helped me most:

Offload immediately. Your working memory is the wrong place to store an idea. The agent is.
Cap the "Now" list. Mine is three. It is not eight. Capacity is the silent cost that agents will happily exceed on your behalf.
Batch the check-ins. Do not supervise. The agent is a night-shift worker, not a pair-programming buddy.

A quiet workplace prediction. ADHD at work has always been uneven: great at the creative parts, taxed by the operational parts. Agents reverse that tax. People with ADHD plus an agent that actually knows their context do not become "normal" employees. They become obviously valuable ones.

ADHD is a spectrum, so one honesty note: this is my brain, not every ADHD brain. And please do not diagnose yourself from a blog post. The full post on Digital Thoughts covers the trade in detail, including what I have been doing about the bad part.

Read the full post: https://thoughts.jock.pl/p/adhd-ai-agent-personal-experience-2026

Originally published on Digital Thoughts (Substack).

Opus 4.7 Made Me Take Token Waste Management Seriously

Pawel Jozefiak — Fri, 24 Apr 2026 15:05:16 +0000

Opus 4.7 Made Me Take Token Waste Management Seriously

Anthropic shipped Claude Opus 4.7 on April 16, 2026. Same per-token price as 4.6. New tokenizer. The official docs say it quietly: "This new tokenizer may use up to 35% more tokens for the same fixed text."

That line is what finally made me stop hand-waving about AI costs and actually audit where my tokens were going. I run Claude agents all day across my stack, and a silent 35% hike on the exact same prompts meant the ceiling for "good enough" just got a lot lower.

Here's what's in the full post:

The two ways tokens leak — I separate waste (turns that produced nothing useful) from inefficient usage (turns that worked but cost way more than they should have). Most guides collapse these into one bucket. They shouldn't.
How I measured it across 133,087 turns — I built a token waste sorter over 9,667 sessions and ran model-vs-model comparisons for the judgment calls. I explain the methodology, what I kept, what I threw out, and which clusters dominated the bleeding.
The top waste cluster was not what I expected — It wasn't hallucinations, bad reasoning, or runaway loops. It was infrastructure. Browser/Playwright failures spread thin across hundreds of cheap sessions outweighed every "smart" optimization I'd been chasing.
Three fixes that cost nothing — Shrinking CLAUDE.md, setting tight max_tokens, and auditing WebFetch failures. I share the before/after for each and which one moved the needle hardest.

The part that surprised me most is in the full post: the meta lesson about where budget actually goes when you measure instead of guess.

Full post: https://thoughts.jock.pl/p/token-waste-management-opus-47-2026

Free weekly: https://thoughts.jock.pl

I Cancelled Codex Two Months Ago. Opus 4.7 Brought Me Back.

Pawel Jozefiak — Thu, 23 Apr 2026 15:09:29 +0000

I let my OpenAI Pro subscription lapse two months ago. Claude Max 20x was covering everything. Last week I renewed ChatGPT Pro. Two hundred dollars a month on top of Claude Max.

Anthropic released Opus 4.7 on April 17. Within five days, I went from one AI subscription to two.

Here's why.

The regression was real — and someone measured it.

AMD engineer Stella Laurenzo analyzed 6,852 Claude Code sessions with 234,760 tool calls. The finding: Opus 4.7 used 80x more API requests and 170x more input tokens to produce measurably worse output than earlier Claude models. Cost impact: 122x more dollars per day on the same workload.

That's not a vibe. That's a dataset.

Marginlab, an independent benchmark tracker, paused its degradation detection to reset baselines for the new model. When a monitoring tool stops tracking because the behavior changed too dramatically, that's a signal.

What the regression looks like day-to-day.

Four specific patterns:

Reading six files instead of sixty when a task requires broader context
Asking clarifying questions on choices already specified
Rewriting entire files when a surgical one-line fix was requested
Returning summary-level analysis instead of deep investigation

Individually, any one is tolerable. All four together, not small.

There's a max-reasoning caveat.

Extended thinking mode changes the picture. Opus 4.7 with max reasoning recovers some of that depth. But most production agent workloads don't run on extended thinking — you'd burn through your weekly quota in an afternoon. At standard settings, the laziness pattern is consistent.

The solution: dual-provider routing.

I now run Claude Code and Codex in parallel. Claude Code wins on visual tooling, prompt caching, and precise surgical edits. Codex wins on web search freshness and architectural analysis.

The dual setup costs $300/month total. What it removes: usage ceiling pressure on both sides. My weekly quota, which I used to hit in two days, now splits cleanly across two providers.

The post covers the specific routing logic, the agent switcher setup that makes context-switching automatic, and why max-reasoning mode doesn't solve the daily-driver problem.

Read the full post: https://thoughts.jock.pl/p/opus-4-7-codex-comeback-2026

Originally published on Digital Thoughts (Substack).

The Benchmark Contamination Crisis (and Why I'm Pivoting LLMatcher)

Pawel Jozefiak — Sun, 19 Apr 2026 22:12:28 +0000

MiniMax scored 80% on a public benchmark. I gave it fresh problems it had never seen: 39%.

Ophus stayed flat at 51% on both.

That 41-point gap is contamination. Models train on internet-scale data that includes every benchmark ever published. By the time a benchmark gets popular, the top models have already memorized the answers.

I've been building LLMatcher for months. Original concept: crowd-sourced model voting. Turns out nobody urgently needed that. Score on my own project criteria: 4.5/10.

But decontaminated benchmarks? 8/10. Structural problem, first-mover opportunity, clear revenue path.

New direction: fresh eval tasks that rotate monthly and never get published publicly. You get your real score, the inflated public score, and the decontamination gap between them.

Validation test: 20+ signups in 48 hours = real demand, build the MVP. Form is live at wiz.jock.pl/llmatcher-signup.

Read the full post: https://thoughts.jock.pl/p/benchmark-contamination-crisis-llmatcher-pivot-2026

Originally published on Digital Thoughts (Substack).

How to Build Your First AI Agent (Basics). Full Package

Pawel Jozefiak — Thu, 16 Apr 2026 11:14:55 +0000

How to Build Your First AI Agent (Basics)

Six months of mistakes, a real walk-through, and everything I wish someone had told me before I started.

I've been building my own AI agent since October. Every mistake you can make on a first build, I've made. Some of them twice.

A few days ago I asked my readers what I should write about for beginners. The answers lined up surprisingly clean. Almost everyone asked for the same thing in different words: the real stuff. What actually goes wrong. What to do on day one. How to start without feeling lost.

So here it is. More structured than my usual posts, because this one is for people starting from zero. If you already have an agent running, most of this will still be useful, but the mental model is written for someone who's never done this before.

One thing before we start. Mistakes aren't failure. For early adopters, they ARE the job. Everyone building in this space is hitting the same walls at the same time, because nobody has the map yet. You're not doing it wrong. You're doing it at all, which is the hard part.

1. What is an AI agent, really (and why it's different from automation)

My starting point wasn't AI. It was Zapier.

I've been building classical automations for years. Zapier, n8n, make.com, custom scripts, connectors glued together with duct tape. When I started thinking about building my own agent back in October, my first instinct was to do exactly what I knew: chain tools together with a workflow builder and call it a day. I actually started that way.

Honestly, for a lot of people reading this, that's still a perfectly reasonable starting point. If you've never built any kind of automation before, go make three Zaps this week. Connect your calendar to Notion. Send yourself a Slack message when an RSS feed updates. Do something small and stupid. Feel how a trigger leads to an action which leads to a result. Those three concepts are the spine of everything that comes next.

The reason I didn't stop at Zapier is the difference between an automation and an agent. An automation is deterministic. Same input, same steps, same output. You define every branch in advance. It's predictable, which is why it's trustworthy for production work.

An agent has wiggle room. You give it a goal and a set of tools, and it decides how to use them. Given the same input twice, it might do slightly different things. It might also do something you didn't anticipate, because the whole point is that it can improvise. Although that sounds risky (and sometimes it is), it's also the thing that makes an agent valuable. If the tool it expected is broken, it can find a workaround or build one. A classic automation just stops.

Neither one is better. They solve different problems. And honestly, most production "agents" out there are closer to classic automations with a language model glued to the top. That's fine. It works. What matters is you know which one you're building, because the failure modes are completely different.

2. Three questions I had to answer the long way around

Before we touch any code, I want to borrow a framing from Zachary Wefel, who left one of the best comments on my original note. He pointed out that writers in tech tend to skip past the most basic things about how software actually exists in the world, because people around them already assume those things. He gave three questions as an example:

Where does the agent live? How do you see it? How do you talk to it?

I had to answer all three for myself, and I took the long way around on all of them. Here's what I learned.

Where does it live?

Mine lives on a Mac Mini next to the main TV in my living room. Before that it lived on my personal MacBook for the first few months, which was fine except I needed my laptop to be on all the time for anything to run. Eventually that got annoying enough that I moved it to its own dedicated machine. That's not a day-one problem.

For your first agent, the answer is: it lives on your laptop. That's it. Your laptop is enough. An agent is just software. It lives wherever that software runs. That can be your laptop, a cheap dedicated computer in your closet, a rented cloud server, or a Raspberry Pi. Don't complicate this before you have anything running.

How do you see it?

You mostly won't. There's usually no dashboard, no slick interface, no moving dials. This confuses a lot of beginners, because we're used to software having a face.

You "see" an agent through what it produces. Files it writes. Messages it sends you. Things it prints in the terminal. Tasks it finishes or fails at. You can build a dashboard later if you want one (I eventually did), but on day one the agent is invisible except for its outputs.

How do you talk to it?

My agent has four channels now: email, Discord, iMessage, and a task app I built for it called WizBoard. That's way more than a beginner needs. You need one channel, and whatever you already use for anything else is a fine pick.

The easiest first channel is the terminal on your own laptop. You type a message. It responds. That's the whole interface. It looks ugly. It's also the most powerful setup you can have for learning, because every other interface is just a fancy wrapper around that same loop.

3. What you need to begin

Before any code, before any chat, here's the kit.

3.1. A machine

Your laptop is fine. Any laptop. Mac, Linux, Windows, all fine. If it can run a browser and a text editor, it can run your first agent. Don't buy anything new.

Later on, if you want your agent to keep working while you sleep or while you're away from your desk, you'll eventually graduate to something that stays on. I wrote about what that migration looked like for me, and it wasn't hard. Although it matters eventually, it's a month-three problem, not a day-one problem.

3.2. A subscription (or API access)

Let me be direct about this part, because I don't see it spelled out often enough in beginner guides.

Free tiers aren't enough. They cap you out fast, and you'll spend your first afternoon hitting rate limits instead of learning. This is the wrong place to save money.

A $20 per month tier is your floor. Claude Pro, ChatGPT Plus, or the equivalent from whichever provider you pick. That tier is genuinely enough to build a simple first agent and get it working. You won't love it forever, but it's more than enough to start.

Power users run more than that. I pay for multiple subscriptions and for API usage on top. My bill isn't small. That's a months-from-now problem. Don't worry about it yet.

Like, think of the $20 as a gym membership. It's the cost of learning the skill. And honestly, it's one of the cheapest upgrades to your toolkit you'll ever make, so don't flinch at it.

3.3. A harness (the tool you actually work with)

"Harness" is the word I use for the tool you sit in front of while building. There are four honest options, and all of them work:

Claude Code. A terminal-based tool from Anthropic. This is what I use most days. Deep file access, built for serious building. Power user territory, but approachable.
Claude Cowork. Also from Anthropic. A built-in cloud app that runs Claude in an agent loop without you ever touching a terminal. If the word "terminal" already makes you nervous, this is probably where you should start. It's genuinely good enough to build your first real agent in, and you can always graduate to Claude Code later.
Codex (or the equivalent from another provider). Same category as Claude Code, different flavor.
A plain AI chat like Claude.ai or ChatGPT in your browser. Yes, you can genuinely start here. You'll be copy-pasting more, but it works completely.

Pick one. Don't spend a week comparison-shopping. The differences don't matter until you've actually built something and know what you need. I wrote a longer piece on what's actually worth learning from a harness like Claude Code if you want a deeper take. But for today, pick one and move on.

3.4. A folder (this is THE architecture)

Here's the mental model that took me three months to see clearly. If you take it seriously, it'll save you those three months.

The architecture of your AI agent IS its folder structure.

That's it. There is no hidden magic layer. Every functional piece of an AI agent lives as a file in a folder on your computer. When someone online says "the agent has tools," what they really mean is: there are scripts in a folder that the agent knows how to run. When someone says "the agent has memory," they mean: there are markdown files it reads at the start of each session. When someone says "the agent has an instruction set," they mean: there's a file called something like CLAUDE.md or agents.md that tells it who it is and what the rules are.

It's all files. That's the whole trick. Once you see the folder as the architecture, the mystery goes away.

Here's what a beginner's agent folder looks like in practice:

my-agent/
├── CLAUDE.md ← instructions (the brain)
├── memory/
│ └── notes.md ← what the agent remembers
├── projects/
│ └── morning-email/
│ ├── fetch-email ← the part that pulls your email
│ └── prompt.md ← how you want it summarized
├── scripts/ ← small helper scripts
└── secrets/ ← API keys, passwords (keep this safe)

Read that tree slowly. Every concept maps cleanly to a file or folder:

Instructions live in CLAUDE.md or agents.md depending on your harness.
Memory lives in markdown files inside memory/.
Tools (what the agent can do) are scripts inside scripts/ or inside each project folder.
Projects live as subfolders under projects/.
Credentials (passwords, API keys) live in a protected secrets/ folder.

When you look at an AI agent this way, it stops being a mysterious entity and starts being something very familiar: a folder with text files in it. I wrote about how I structure the CLAUDE.md file itself after more than a thousand sessions, and that file is the single most important thing you will own. For now, just sit with the idea: the whole agent is a folder.

4. Build your first agent, step by step

Enough theory. I want you to finish this post with a real working agent, not just an understanding. I'm going to walk through the exact project I recommend for a first build: an agent that reads your overnight email and writes you a one-paragraph morning summary.

I picked this one on purpose. It's small enough to finish in an afternoon. It's real enough that you'll actually use it tomorrow. And it'll make you hit most of the real challenges in building any agent: authentication, permissions, context, prompt design, error handling. You'll learn more from building this than from reading any number of articles about it.

Step 1. Decide what you want (fifteen minutes, no code)

Open your chat tool of choice. Not to write code yet. Just to think out loud. Describe your morning:

Every morning I open my email. I scan 40 messages. I figure out which three actually matter. I want a one-paragraph summary of the important stuff before my coffee is done.

That's your spec. Keep it this short. If you can't explain what you want in one honest paragraph, you don't understand what you want yet, and the agent isn't going to save you from that. Better to figure it out before you write a line of code.

Step 2. Create the folder (five minutes)

Make an empty folder on your computer. Call it my-agent. Inside it, create the skeleton:

my-agent/
├── CLAUDE.md
├── memory/
├── projects/morning-email/
├── scripts/
└── secrets/

Empty folders are fine. We'll fill them as we go. The only reason to make them now is so your agent has a place to put things.

Step 3. Let the AI draft your instructions file (ten minutes)

If you're using Claude Code, there's an even shorter way to start. From inside your empty my-agent folder, run the /init command. Claude Code looks around, figures out what it's dealing with, and drops an initial CLAUDE.md in there for you. That's your starting point. One command, done.

If you're in a different harness or a plain chat, type something like:

I want to build an AI agent whose first job is to read my email inbox every morning and write me a one-paragraph summary of what matters. Draft a CLAUDE.md instructions file for it. Keep it under 50 lines. Don't assume anything about my setup.

Either way, you'll end up with a file called CLAUDE.md inside your folder. That's the starting version. It will be rough. That's fine.

Step 4. READ the CLAUDE.md (this is the most important step in this entire post)

I'm not joking. This one step is worth more than the other seven combined.

Open the file the AI just wrote. Read every line. Ask yourself:

Does this actually describe what I want?
Are there weird assumptions baked in that I didn't ask for?
Does the voice sound like me, or like corporate blog filler?
Is there anything in here that surprises me?

Edit it until it reads like you wrote it. Remove anything you don't understand. Add anything the model forgot. This file is the brain of your agent. If it's wrong, every single thing downstream of it will also be wrong, and you'll spend hours later chasing a ghost that started right here on day one. More on why in the mistakes section.

Step 5. Tell it what to automate (around thirty minutes)

Now the actual building. Here's the key thing to understand, and it's the reason I'm not writing out a bunch of code for you to copy: you don't have to. You can just describe what you want in plain language, and the harness will figure out the rest.

Back to your harness. Say something like:

I want the first thing in projects/morning-email to read my email inbox, pull the last 12 hours of unread messages, and hand them off to be summarized. The end result should be a one-paragraph summary of what actually matters. Figure out the best way to do this on my setup and walk me through it step by step.

That's it. That's the entire prompt. No code, no jargon, no pretending you know what a shell script is.

A good harness, which is all of them these days, will then ask you follow-up questions. What email provider do you use? Mac, Windows, or Linux? Do you already have API credentials? Do you want this to run on a schedule, or only when you ask for it? It'll figure out the right tool for the job and explain each step as it goes. You just answer the questions honestly.

This is the real difference between working with an agent and writing code from scratch. You're not supposed to know in advance what tool or file format or library it's going to use. That's its job. Your job is to know what you want and to check the output when it lands.

Step 6. Let it build, but put the AI call at the END of the pipeline

While your harness is building, there's one thing to steer. This might be the biggest efficiency lesson in the whole post: AI doesn't belong in every step of the pipeline.

Your agent is going to fetch email. Fetching email is a problem boring, non-AI code has solved for 30 years. You don't need a language model for that part. The only part that actually needs a language model is the summarizing, because that's the part that requires understanding the content.

So tell the harness explicitly:

Keep AI out of the fetch step. Use whatever normal tool is appropriate there. Only use the language model at the very end, for the summarization itself. One call total, not one per email.

It'll handle this correctly if you ask for it. Usually it won't volunteer to do it this way, because stuffing an LLM into every step feels more impressive and uses more tokens. You'll thank yourself later. I wrote a whole piece on when to use AI and when to just use normal code, and the rule from that post applies directly here: use AI where judgment or language actually matters, and use plain tools for everything else.

Step 7. Run it (five minutes)

Now run the thing you just built. There are two honest ways to do this, depending on how comfortable you are:

The non-technical way: just ask your agent to run it for you. In Claude Code, Claude Cowork, or Codex, you can literally say "run my morning email agent" and it'll execute the thing it just built and show you the result. This is the easiest path if you're not comfortable in a terminal. It works. Use it.
The technical way: if you like knowing exactly what's happening, ask the harness "what command do I run to execute this myself?" and it'll give you the one-liner to paste into your terminal. Then you're running it directly, no agent in the loop.

Either way, you should see your morning summary print out. If you see it, you just built an AI agent. Congratulations. Go make coffee.

Step 8. When it breaks (this is where the real learning is)

It will break. Something won't authenticate, or the summary will be garbage, or it'll pull emails from the wrong time window. Good. This is the part you can't skip, and it's where the actual learning happens.

Read the error literally. Don't panic. Paste the whole thing back into your harness and ask it to explain what happened and what to try next.
If the behavior keeps drifting from what you want, the problem is almost always in CLAUDE.md. Go back and fix the instructions there first.
If the summary is the wrong shape or tone, fix the summarization prompt.
If no data is coming through at all, the problem is earlier in the pipeline, and the agent can usually diagnose this for you in two or three back-and-forths.

That's it. You have a real agent now. It's small, it's yours, and it does one thing you actually care about. Everything else in the rest of this post is about what will bite you as you grow it into something bigger.

5. The mistakes I made (so you can skip them)

This is the section my readers asked for the loudest. Opinion AI, who left the top comment on my original note, put it better than I could:

Would love to see you cover the mistakes people make on their first agent build. The "what not to do" part is always more useful than the setup guide, and almost nobody writes about it.

Agreed. Here are the ones I actually hit.

Mistake 1. Trusting the AI blindly to write your instructions file

Back in October, I was in a hurry. I let the AI generate my first CLAUDE.md and didn't read it carefully. I ran with it. Things worked, sort of. Then the agent started doing weird things I hadn't asked for. Small weirdness at first. Then bigger.

I spent hours, maybe days, chasing ghosts. Poking at different parts of the architecture. Swapping tools. Adjusting prompts. Burning billions of tokens trying to figure out what was happening. The root cause turned out to be a single misguided sentence near the top of the instructions file that I hadn't bothered to read on day one.

The rule is simple and I'll repeat it because it matters: you can use AI to generate your instructions. You can't skip reading them. Ever. Read every line at least once. Edit until it sounds like you wrote it.

Mistake 2. Letting self-improvement run wild on the core files

Some time later, I built a self-improving layer. The agent could look at its own behavior, notice patterns, and update its own instructions. Technically brilliant. I was proud of it.

I also forgot to tell it which files it was allowed to touch.

Within a few days it had rewritten large parts of the core CLAUDE.md in ways I'd never sanctioned. The agent started drifting in five directions at once. Things I had explicitly told it to do were getting silently overwritten by its own "improvements." Although I was proud of the self-improvement layer as an idea, I had to roll a lot of it back and rebuild it from scratch.

The fix was about scope. Each project in my agent now has its own small instruction file and its own little memory file. When self-improvement runs, it touches those leaf files, not the core. The trunk stays protected. The branches can grow. I eventually wrote a longer piece on the full self-improvement architecture if you want the deep version. For a beginner, the takeaway is simpler: never let any automated process write directly to the core instructions file. Ever.

Mistake 3. Ignoring open source out of pride

I wanted to build the whole thing myself. I refused to look at what other people were doing on GitHub. I told myself I didn't want to be influenced.

That cost me two or three months.

Around month three I finally caved and started reading other people's agent repos. Not to copy the architecture (which usually wouldn't fit anyway), but to steal concepts. One example: I found a file called SOUL.md in an open source project. I'd only been using CLAUDE.md at that point, trying to cram every aspect of the agent into one file. SOUL.md turned out to be a dedicated place for personalization: values, voice, what the agent is like as a personality. That small idea opened up a whole layer for me that I'd been clumsily stuffing into the main instructions. I was a better agent designer the day after I read it than I was the day before.

Bianca Schulz asked about open source frameworks in the comments on my note, and here's the honest answer: read them, borrow concepts, don't feel obligated to adopt any single one of them wholesale. Your agent doesn't need to look like anyone else's. But you should know what the good ones are doing.

Mistake 4. Using the strongest model for every single task

For a long time I was running Opus on everything. Every small query. Every file read. Every trivial check. I'd hit my usage limit before lunch and then panic.

The fix is something I now call model routing, and it cut my usage dramatically:

Fast and simple stuff goes to a small model, often a local llm now. Before that I was using Haiku.
General work, planning, most coding goes to a mid-tier model. For me that's Sonnet 4.6. This is where most of the work happens.
Hard reasoning, critical code, strategic decisions go to Opus 4.6.

I wrote in detail about why this switch made the agent both cheaper and better. Short version: nobody is going to optimize your usage for you. You have to do it yourself, and you should do it earlier than I did.

Mistake 5. Trying to build Jarvis on day one

If I'm being completely honest, my original fantasy was Jarvis from Iron Man. One agent that solved everything, ran my whole life, handled the business, wrote the blog, managed the calendar, raised the kid. The whole thing. From day one.

That was the real mistake, and basically everything else downstream of it was a consequence. I started with expectations that were impossible to meet in week one, so I kept pushing the architecture too hard and too fast. I'd add five features at once when I should've added one and let it settle. Although I did get a fully autonomous version working eventually, I had to roll a lot of it back.

The version that actually works, the one I have now, is the one I should've been building from the start: incremental. One small task. Then the next. Then the next. The big Jarvis-like thing did emerge eventually, but as a side effect of building a hundred small working pieces, not as a top-down design.

Full autonomy without taste isn't really what you want, either. The problem with a fully autonomous agent isn't that it can't do things. It's that it has no way of knowing whether the thing it just produced is actually good, because the thing that decides "good" is usually you. Your standards. Your instincts. Your sense of what's off.

My agent is still autonomous for a large set of predictable tasks: morning reports, evening summaries, urgent flags, inbox triage, some experiments. Anything where the shape of "good" is well-defined. For anything creative, strategic, or quality-sensitive, I'm firmly in the loop.

Think of an agent as a partner, not a solver. And don't try to build Jarvis on day one. Build one small, honest thing that works, then build the next one on top of it. That's the only order of operations that actually converges.

Mistake 6. Putting AI in every step of every pipeline

Early on, every single thing my agent did had a language model call somewhere in it. Fetching data. Moving files. Routing messages. Formatting output. LLM everywhere, because LLMs felt magical and I wanted to use them for everything.

One morning I noticed I was at 50% of my 5-hour usage window before I'd actually done any real work. Just from the agent's own background tasks waking up.

The fix was boring and obvious in hindsight: most of a pipeline can be a plain script. Move data from A to B with a script. Call the model exactly once, at the end, for the one thing that actually requires language. That's what the model is for. Everything before that is plumbing, and plumbing should be code.

AI isn't free. Even local models cost time, electricity, and capacity. You don't need AI everywhere. You need it where the language or the judgment actually matters.

Mistake 7. Forgetting that your harness updates constantly

Claude Code updates almost daily. Codex updates often. Every harness does. This is mostly a good thing, except for one small catch: features you built from scratch will sometimes get shipped natively by the tool you're building on, and now you have the same thing twice. Your custom version and the new native version start fighting each other, and the output drifts in ways that are hard to diagnose.

My fix was a small automation that checks for updates every day and flags anything in my custom code that overlaps with new native features. When it finds one, I delete my version and use the native one. Cleaner, less code to maintain, better integration.

If you don't do something like this, after a few weeks you'll notice things wiggling and conflicting and you won't know why. The harness moved under your feet. It's the cost of building on a fast-moving platform, and you just have to pay attention to it.

Mistake 8. Installing skills from a marketplace without checking them

This one is newer, because skill marketplaces and shareable agent extensions are newer. Claude Code now has a growing ecosystem of skills you can drop into your agent. Other harnesses have similar things. The idea is great: someone else already solved a problem you have, you install their skill, you save hours.

The catch is that a skill is code that runs on your machine with your agent's permissions. If you install one without understanding what it does, you've effectively given a stranger a seat at the table inside your setup. Most skills are fine. Some aren't. I already wrote about a case where malware was hidden inside a Claude Code skill, which is why I built a scanner for them in the first place.

The rule I follow now, and the one I'd give you from day one: before installing any skill from any marketplace, ask yourself two questions. Do I actually need this, or am I installing it because it's there? And do I understand, at least roughly, what it's allowed to do? If you can't answer both, don't install it yet. Read its source. Ask your agent to walk you through what it does. Treat it like any piece of software from someone you've never met, because that's what it is.

Mistake 9. Not using Git from day one (the mistake I'm glad I didn't make)

I want to be honest here: this one isn't actually my mistake. I started using Git from the very beginning on every agent project I've ever built, and that single habit has saved me more times than I can count. I'm including it because the number of beginners I've watched skip it and then lose weeks of work is too high to leave out.

Git is the thing that lets you roll back to a working version when something goes wrong. And something will go wrong. Your agent will make a change to a file you didn't expect. You'll delete the wrong folder. You'll let the model rewrite something that was working and discover two days later that the new version is worse. Without Git, you're stuck trying to remember what the file looked like three days ago. With Git, you type one command and you're back.

The good news is this is now genuinely easy, even for non-technical people. You can ask your harness to set up a Git repository for you and it'll do the whole thing. Private repo on GitHub is free and fine. You can even set up an automation so that every time your agent finishes a meaningful task, it commits and pushes the current state to the repo automatically, which means you basically never lose work. I set mine up like that and I haven't thought about it since.

If you remember nothing else from this section, remember this: commit and push every working version of your agent, from the very first day. It's the cheapest insurance policy in the whole setup, and every single person who has ever lost work to a runaway edit wishes they'd done it sooner.

Bonus mistake. Thinking you need to build alone

I'll say this honestly because I lived it: building an agent in isolation is much slower than building one while reading what other people are running into. Communities, newsletters, GitHub discussions, random Substack notes at midnight. The people doing this work are almost all willing to share what they're learning. Go find them. I learned some of the most important things I know from comments on my own posts, which is the only reason this post exists at all.

6. Context window is the whole game

hohoda in the comments on my original note nailed something I think about constantly:

The context window is the real constraint. Everything else, tools, models, memory, is downstream of how well you manage what the agent sees at any given moment.

200,000 tokens sounds like a lot. It isn't, once you understand what fills it.

Every session auto-loads a bunch of stuff before you've even typed anything: your core instructions file, your memory files, the conversation history if there is any, the current task state. That's your "always-on" overhead. For me, that adds up fast. It's a cost I didn't fully understand at first, because it happens before you see a single response.

For a beginner, three rules carry you a long way:

Keep your CLAUDE.md thin. Every line you add is a line the model has to read at the start of every single session. Treat it like precious real estate. If you can say it shorter, say it shorter.
One memory file per project, and that's it. Don't build a vector database. Don't install a semantic search engine. Don't set up a temporal knowledge graph. Not on day one. A flat markdown file per project is enough for a surprisingly long time. That's how I started and it worked for months.
Don't worry about compaction yet. Eventually, once your memory files get large, you might want a process that rewrites them to stay under a size threshold. I run one every night now. That's a month-three problem, not a day-one problem.

For almost any beginner project, 200k tokens is more than enough. A back-and-forth conversation over iMessage barely touches the budget. The failure mode is almost never "model context too small." It's "my CLAUDE.md bloated to 800 lines and now every session starts with a giant anchor around its neck."

I wrote a longer piece on how I keep my own CLAUDE.md structured after a thousand plus sessions if you want to see the mature version. For now, just remember: thin instructions, one memory file per project, and context is the first thing that'll bite you when the agent starts behaving strangely.

7. Security from day one

Bianca Schulz asked about security on my note, and this is the section I think about the most when I write pieces like this. It was one of the biggest reasons I built my own agent instead of using an off-the-shelf one.

Here's the thing: an AI agent is a new attack surface on your computer. It has permissions. It runs code. It reads your files. It talks to the internet. And because we're still early in how this all works, the models that drive it can be tricked, manipulated, or prompt-injected in ways we don't fully understand yet. You're adding a new thing with a lot of power to your machine, and you should act like that.

My progression was deliberate, and I'd recommend something similar for you:

MacBook phase. Very restricted permissions. Only the folders I explicitly whitelisted. No blanket network access. No access to real credentials. I built slowly and paid attention to what the agent actually needed. My personal machine has my personal things on it, and I wasn't about to let a half-built agent loose in there.
Learning phase. As I understood what the agent actually needed and could trust it with, I expanded its permissions carefully.
Dedicated machine phase. Eventually I moved it to its own Mac Mini. An isolated computer, dedicated to the agent, with its own accounts and its own credentials. That machine is where the agent has broad permissions. My personal laptop doesn't, and never will again.

A rule I learned the hard way and will give you for free: the agent should have its own accounts, not yours. Its own email address. Its own API keys. Its own logins. Don't share your personal credentials with it. When something goes wrong, and something will eventually go wrong, you want the blast radius to be contained.

Two months ago I launched a small tool called a security scanner for Claude Code skills, which hit the front page of Hacker News. I built it because I was reading stories about autonomous agents being exploited in the wild and realized I wanted a way to check my own setup against a list of known issues. If you're running anything serious, something like this is worth having in your toolbox. And even if you're not, just paying attention to permissions from day one will put you ahead of almost everyone else building in this space.

Closing. Start small, start today

You don't need the strongest model. You don't need a fancy framework. You don't need a PhD in machine learning or expensive hardware or a cloud account.

You need:

A laptop you already own.
A $20 per month subscription to a real model.
A harness. Any harness. Pick one.
A folder on your computer, with CLAUDE.md, a memory/ subfolder, a projects/ subfolder, and a secrets/ subfolder.
One real project you actually want to exist. Not a demo. Something you'd use tomorrow morning.

Start with that. The rest (all the architecture and the self-improvement and the model routing and the memory compaction) comes as you grow into it. None of it needs to exist on day one.

Everything will break regularly. Your harness will update under your feet. Your instructions file will drift. Your context window will bloat. The model will hallucinate a function that doesn't exist and confidently insist it does. Although it cost me a lot of time at the beginning, I really don't mind it anymore. It's the job right now, and I accept that. I wrote my first piece about Wiz back when it was just a night-shift experiment, and looking back, almost everything I thought I knew then was wrong. That's fine. The only thing that compounds is the habit of building, breaking things, fixing them, and writing down what you learned.

The people in my comments who asked for this post already know more than most. Almost all of you have the instinct, and most of you have the tools. What's left is the part I can't do for you: opening the folder, writing the first line of CLAUDE.md, and running something small tonight that didn't exist this morning.

Go build your first agent. Then tell me what broke.

I write about building Wiz, my AI agent, roughly twice a week on Digital Thoughts. Every mistake, every rebuild, every thing that surprised me along the way. If this post was useful, subscribe and you'll get the next one as soon as it goes out.

Originally published at Digital Thoughts on Substack

Claude Code vs Codex CLI vs Aider vs OpenCode vs Pi vs Cursor: Which AI Coding Harness Actually Works Without You?

Pawel Jozefiak — Thu, 16 Apr 2026 11:09:35 +0000

Claude Code vs Codex CLI vs Aider vs OpenCode vs Pi vs Cursor: Which AI Coding Harness Actually Works Without You?

My AI agent wakes up at 2am, picks tasks from a queue, ships code, and sends me a report by morning. For that to work, I need a coding harness I can trust when I'm not watching.

Not a tool that helps me code faster. A tool that codes when I'm asleep.

That's a different question than "which IDE is best." IDEs are for humans who are present. Harnesses are for when you're not. It's also not the same question as "which has the best autocomplete." That's a different category entirely, one we're not touching here.

I've used Claude Code daily for months, run Codex CLI and OpenCode in parallel, tested Pi, and dug into the open-source alternatives. This is what I actually think.

What a Harness Actually Is

A harness connects the horse to the cart. In AI coding, it's the set of tools and environment in which the agent operates.

Here's the thing most people miss: LLMs can only generate text. That's it. They can't read your files, run commands, or edit code directly. What a harness does is give the model structured tool calls it can emit as text. The harness intercepts those, executes them with real code, appends the output to the conversation history, and prompts the model to continue. Every tool call follows the same loop: model pauses, harness runs something, result added to context, model restarts. At its core this is about 60-75 lines of Python. The complexity is entirely in the tuning: what tools the model gets, how those tools are described, and what the system prompt says.

This matters because the tuning is where harnesses actually diverge. Two harnesses running the same model on the same task can produce dramatically different results. Not because of the model, but because of what the harness tells the model it can do and how to use it.

Tab autocomplete isn't a harness. It's a suggestion box. A nice UI on top of an existing harness (like T3 Code, which wraps Claude Code and Codex CLI) is also not a harness. The real question for every tool below: can it take a task, execute it end-to-end across multiple files, handle errors, and report back without me in the loop?

Two Different Categories: Coding Tools vs Agent Orchestrators

Before comparing specific tools, it's worth naming the split that most comparisons ignore. Not all "AI coding harnesses" are trying to do the same thing.

Coding tools are pair programmers. You direct each step. They execute that step very well, commit the result, and wait for the next instruction. Aider is the clearest example. Codex CLI leans this way too. Cline. These are tools built around the assumption that you're at the keyboard and providing direction. They make individual tasks faster and better. They're not designed to chain 40 decisions together autonomously while you sleep.

Agent orchestrators are designed to take a goal and execute autonomously across multiple steps, files, and decision points. Claude Code is built for this. Devin is the extreme version. Pi, if you build out the harness fully, fits here. These tools are designed around the assumption that you're not watching, and they need to make judgment calls without asking.

Most comparisons treat all of these as the same thing and rank them on the same axis. That produces misleading results. Aider isn't trying to replace Claude Code for overnight autonomous runs. Codex CLI isn't trying to be an agent orchestrator in the same sense Claude Code is. Judging them by the same criteria produces noise.

The honest answer to "which is best" depends entirely on which category you need. This post tries to be clear about which tools belong where, and let you make the call for your workflow.

The Benchmark Reality (And Why It Doesn't Tell the Full Story)

SWE-bench Verified became the standard benchmark for this category. It measures how often a coding agent independently resolves real GitHub issues from start to finish. That status also made it a target. Researchers flagged contamination: training data for newer models overlaps with the test set, which inflates scores. The cleaner alternative is SWE-bench Pro, introduced in 2026, with 2,000+ problems that weren't in any public training data. GPT-5.4-Codex leads there at 56.8%. Harder problems, more honest scores.

Terminal-Bench 2.0 deserves a separate mention because it's more relevant for agentic tasks than SWE-bench. It tests autonomous, multi-step execution in real terminal environments. Not just code edits. Actual shell navigation, file management, running commands in sequence, recovering from errors. The Claude Code harness configuration benchmarked here ("Claude Mythos") hits 92.1%. Codex CLI hits 77.3%. That 15-point gap is a better signal for overnight autonomous work than SWE-bench numbers.

Now the result that breaks the "pick the highest number" logic. Matt Mayer ran an independent test comparing the same model inside different harnesses. Claude Opus: 77% in Claude Code, 93% in Cursor. Same model. Same tasks. 16 percentage points from the harness alone. That's not an outlier. CORE-Bench found Claude Opus at 42% with a minimal scaffold, rising to 78% inside Claude Code's full harness. Across multiple independent studies the harness effect ranges from 5 to 40 percentage points depending on model and task type.

A few flags before reading the tool sections. Cursor doesn't publish SWE-bench Verified results and uses its own proprietary CursorBench at 61.3% instead. Draw your own conclusions. OpenCode and Pi have no published scores because their performance is entirely model-dependent. Devin's frequently cited 13.86% figure is from 2023 and belongs in a museum. It does not appear in the current top 30 of any major leaderboard.

What the scores actually tell you: harness quality matters as much as the model you put in it. Cursor employs people whose full-time job is to rewrite system prompts and tool descriptions every time a new model ships. Claude will keep using a tool you label "deprecated." Gemini will abandon structured tools entirely and only use bash. Cursor tests obsessively and adjusts. Most harnesses don't. Keep this in mind across every section below.

Claude Code: The Deep Harness

Category: Agent orchestrator | code.claude.com | GitHub (114k stars)

Full disclosure: this is what I use daily, and what runs Wiz on a headless Mac Mini overnight. I try to be honest about it.

Claude Code is the most complete agentic runtime available right now. It reads CLAUDE.md, a project-specific instruction file that persists across every session. You can describe your entire architecture, your preferences, your forbidden patterns, and the agent carries that into every run without you repeating it. It has Agent Teams for spinning up parallel sub-agents that coordinate on a shared goal. As of March 2026, computer use means it can point and click through UIs, take screenshots, and handle workflows that resist scripting.

The thing I keep noticing with Claude Code is that it genuinely builds on context over time. A session that starts with "add authentication" will remember the decisions it made about your auth architecture when it gets to "add rate limiting" three steps later. That coherence across a long task chain is what makes it feel like an agent rather than a very fast typist.

One important thing about how any harness uses context: the model only knows what's in its conversation history. When Claude Code opens your project, it doesn't already know your codebase. It explores via tool calls, building context incrementally. CLAUDE.md front-loads that context so fewer tool calls are wasted on discovery. Dumping your entire codebase into context (the old Repomix approach) is the wrong answer. Past around 50-100k tokens, model accuracy drops significantly. More context makes models dumber past a threshold. Good harnesses build context as needed, not all at once.

Where it struggles: context loss on sessions longer than 2 hours, where it starts forgetting early decisions. Terminal-only interface has a real learning curve. Token consumption is 3-4x higher than Codex CLI per equivalent task, which compounds on long autonomous sessions.

Best for: complex multi-file tasks, overnight autonomous runs, architecture-level changes that require consistent context across many steps.

Pricing: Claude Pro ($20/mo) or Max ($100+/mo). For regular autonomous sessions, Max is almost certainly necessary. The per-token costs on long runs add up fast. For a detailed Claude Code vs Codex head-to-head from two months of real usage, I covered that comparison separately.

Codex CLI: Good, But Not What the Hype Says

Category: Coding tool, emerging agent | openai.com/codex | GitHub (67k stars)

Codex CLI is not the old Codex model from 2021. It's OpenAI's terminal-based agent, open-source on GitHub, bundled with ChatGPT Plus or Pro, running on GPT-5.4. The benchmark puts it at 77.3% on SWE-bench, close to Claude Code's 80.8%, and at 3-4x lower token cost. On paper, a strong contender.

In practice, my honest read: it's cold. That's the right word. What I mean is that Codex CLI feels raw as an agent. It executes individual steps cleanly, but it doesn't feel like it's building toward something the way Claude Code does. Give it a multi-step task: add this feature, connect it to this other component, update the tests. It handles step one well, sometimes step two, and starts losing coherence by step three or four. It restates what it did, asks for clarification it shouldn't need, or misses a dependency it should have caught from context it already has. That gap between 77.3% and 80.8% is exactly this: Claude Code holds context through longer chains.

Where Codex CLI genuinely shines is raw coding quality on focused tasks. iOS apps, macOS apps, web apps. Give it a specific, contained task and GPT-5.4 is excellent. The code quality on front-end work, app scaffolding, and UI logic is strong. I'd put it on par with or ahead of Claude Sonnet for this category of work. It's not the harness that's the advantage there. It's GPT-5.4 being particularly strong at app development.

The architectural difference worth knowing: Codex CLI runs in cloud containers managed by OpenAI, not on your local machine. You can fire off a task and disconnect. The task keeps running without your terminal staying open. For batch work and overnight jobs where you're not monitoring, that's genuinely useful. For tight local loops where your environment variables and local state matter, you're working around the sandboxing.

Where it struggles: multi-step agentic chains with dependencies. Feels unfinished as a full harness compared to Claude Code. Less context coherence on complex tasks.

Best for: focused coding tasks (especially apps), token-efficient runs, developers already on ChatGPT Plus who want to try a CLI agent without extra cost.

Pricing: included with ChatGPT Plus ($20/mo) or Pro ($200/mo). If you're already paying for ChatGPT, this is essentially free to try.

Aider: The Underrated Open-Source Standard

Category: Coding tool (pair programmer) | aider.chat | GitHub (43k stars)

Aider is the tool most people in the "AI coding" conversation have never used, and it has 43,000 GitHub stars and 15 billion tokens processed per week in production. That's not a toy project.

The model is fundamentally different from Claude Code or Codex. Aider is a git-first pair programmer, not an autonomous orchestrator. You bring your own model, Claude Sonnet, GPT-5, Gemini 2.5, DeepSeek, Qwen, local Ollama, and Aider wraps it with git-native execution. Every AI edit becomes a commit. The repo map gives it structural understanding of your whole codebase before it touches anything. It auto-lints and runs tests after every change, self-fixing detected issues before reporting back.

The token efficiency is striking: 4.2x fewer tokens than Claude Code per equivalent task. If you're paying for API access directly, Aider with Claude Sonnet is the most cost-efficient path to serious coding automation by a wide margin.

The honest tradeoff: Aider doesn't orchestrate across 40 files and coordinate sub-agents. It executes a task, executes it well, and commits the result. It's more like having a disciplined pair programmer who never skips a commit than a system that independently plans and executes a multi-hour architecture session. For incremental work, refactoring a module, implementing a feature, fixing a class of bugs, it's the right tool. For overnight autonomous sessions that need to make judgment calls across large contexts: Claude Code.

The git-first philosophy deserves separate mention. Every change is committed. Your entire interaction with the agent is auditable, reversible, and reviewable inside your normal git workflow. No other tool in this list bakes that in at the same level.

Best for: focused incremental work, budget setups, teams that want full audit trails, developers who want BYOM flexibility without giving up discipline.

Pricing: free. You pay your model provider directly.

OpenCode: The Provider Switcher

Category: Hybrid (coding + emerging agent) | opencode.ai | GitHub (72k stars)

OpenCode's value proposition is breadth: 75+ LLM providers, all accessible from the same interface. Anthropic, OpenAI, Google, DeepSeek, AWS Bedrock, Azure, local Ollama, and more. I've used it with Claude Opus, GPT models, and open-weight models like Qwen and GLM. The switching experience is genuinely seamless in a way that nothing else matches. One command, different provider, same workflow. You can't do that in Claude Code or Codex.

But I'll be honest about something: there's something missing from the experience. It's hard to name exactly. After using it alongside Claude Code for a while, I notice OpenCode doesn't feel like it's building a working relationship with your project. There's no CLAUDE.md equivalent that persists project context. There's no Agent Teams layer for coordinating parallel work. The autonomous behavior is functional but less mature. It handles individual tasks well, but it doesn't feel like a system designed for extended unattended operation.

With open-weight models like Qwen and GLM, it's fine. Gets the job done for straightforward tasks. You're not going to get Claude Opus-level reasoning, but for routine edits and quick fixes, the cost savings are real.

The provider switching is genuinely the killer feature. If you're doing model experiments, comparing how GPT-5.4 handles a task vs Claude Sonnet vs a local Qwen, OpenCode is the tool for that. If you already have subscriptions to multiple providers and want to use them without managing separate CLI tools, OpenCode is the right architecture. But for a long-term primary agent setup where you need consistent, deep project context: I'd reach for something else.

Best for: model experimentation, teams with multiple provider subscriptions, privacy-first setups with local Ollama, cost arbitrage across providers.

Pricing: free. BYOM.

Pi: The One I Actually Want to Use More

Category: Coding tool + primitives harness | pi.dev | GitHub

Pi is genuinely different from everything else here, and I want to say this upfront: I like it. It's fast, it's flexible, and the experience is clean in a way proprietary tools often aren't. If I could choose without constraints, Pi is probably the closest thing to what I'd want as a daily harness alternative to Claude Code.

The design philosophy is the opposite of the "more features" trend. Its tagline is blunt: "there are many coding agents, but this one is mine." Instead of an opinionated harness, it gives you primitives. A minimal core you configure yourself. Terminal TUI, 15+ LLM providers, tree-structured session history you can navigate and export, and four operation modes. The interesting one for builders: RPC mode. Pi runs as an embeddable subprocess inside a larger automation system. Your orchestration layer calls Pi, it executes the coding task, returns structured output. Designed to be a component in a system, not a standalone tool.

What's deliberately absent: sub-agents, plan mode, permission popups, background processes. Pi's bet is that most harnesses embed too many assumptions about your workflow. Strip to primitives, ship extensions via npm, build exactly what you need. AGENTS.md and SYSTEM.md play the same role CLAUDE.md does in Claude Code.

So why am I not using it more? One reason, and it's a real one: Anthropic's billing doesn't let you bring your Max subscription to third-party harnesses.

Pi is BYOM, bring your own API key. When I tested it with Claude, Pi surfaced a message explicitly: usage through Pi counts against API billing, not your Claude subscription. So if you're on Claude Max ($100+/mo), using Pi with Claude means paying twice. The Max subscription for Claude Code, and API rates on top for Pi. Those costs add up fast on any serious coding session. I was paying from my own pocket to test something I wanted to use more. That's not a good feeling.

This isn't Pi's fault. It's Anthropic's policy. They don't allow third-party harnesses to draw on subscription credits. You have to use Claude Code to get what you're paying for on the subscription. Google does the same with Gemini. Theo from T3 made this point in a recent video on harnesses: if you're paying $200/month for Opus, you have to use their harness. OpenAI, by contrast, lets your API credits work across third-party tools freely.

In a world where Anthropic changed this, where your Max subscription applied to any MCP-compatible harness, Pi is probably what I'd reach for first. The speed, the flexibility, the primitives-first design: it fits the kind of automation system I'm building. But until that policy changes, the economics don't work for anyone on a Claude subscription. You pay for Claude twice if you want to experiment with a different harness.

If you're on GPT or open-weight models (Qwen, DeepSeek, GLM), Pi has none of these constraints. The billing goes through OpenAI or your provider directly. For a Claude-first setup: this is the wall you'll hit.

Best for: GPT or open-weight model setups, building custom harness architectures, embedding a coding agent as a subprocess in larger systems, developers who want full control with no opinions baked in.

Not ideal for: Claude-first developers on Max. You'll pay API rates on top of your subscription.

Pricing: free, MIT license. BYOM. Factor in API costs if using Anthropic models.

Cursor: The Best Supervised Experience, Not Yet a Harness

Category: IDE with supervised agent mode | cursor.com

Cursor is an IDE first. Its agent mode deserves inclusion in this conversation because of how fast the direction is changing, not because it's a harness today.

Cursor 3 (released April 2026) added cloud agents on isolated VMs, /worktree for isolated branch changes, self-hosted agents, and parallel Agent Tabs. 30% of Cursor's own internal PRs are now agent-made. The supervised IDE experience, Design Mode where you annotate a mockup and get an implementation, parallel agents, and deep JetBrains support, is the best developer experience available at the keyboard right now.

As an overnight harness: not there. When left without supervision, it stalls at the first ambiguous decision point. That's not a bug. It's a design choice. Cursor is built for developers who are present and want an agent that won't make unilateral decisions on their codebase. That's the right call for most developers. It means Cursor isn't the right tool for autonomous runs.

The 77% to 93% Opus benchmark is the thing worth studying. Cursor extracts more from the same model through obsessive harness tuning. People whose whole job is to rewrite system prompts and tool descriptions for each new model release. The gap is real and compounds across tasks. The cloud agents direction makes me think this section of the comparison will look very different in 12 months.

Best for: daily supervised coding, developers who want the best IDE-plus-agent experience at the keyboard.

Pricing: Hobby (free), Pro ($20/mo), Ultra ($200/mo), Teams ($40/user/mo).

A Few More Worth Knowing

Goose (Block/Square, GitHub, 41k stars): Open-source, MCP-based, general-purpose agent. Not coding-specific, but handles code tasks well. Right fit if you want automation that goes beyond coding into broader workflows. Apache 2.0 license.

Cline (GitHub, 60k stars): Open-source, supports VS Code, JetBrains, Neovim, Emacs. Widest multi-IDE coverage of any tool in this list. Good MCP support. Worth looking at if your stack spans multiple editors.

Gemini CLI (Google, GitHub, 96k stars): Free with a Google account. 60 requests/minute, 1,000/day, 1 million token context window. Genuinely generous free tier. Strong on frontend tasks. The right starting point if budget is the hard constraint and you don't have API credits elsewhere.

Devin (Cognition): Full autonomy, cloud sandbox, Linux shell, browser. Significantly more accessible than before: Core tier at $20/mo plus $2.25 per ACU (autonomous compute unit). Resolves 13.86% of real GitHub issues end-to-end, a dramatic improvement over what was possible two years ago. Worth evaluating for teams with consistent engineering backlogs, not just enterprise anymore.

T3 Code (Theo): Not a harness. A UI wrapper on top of Claude Code and Codex CLI. Useful to name because it comes up in these conversations. If you don't have Claude Code installed, T3 Code won't do Claude tasks. The UI is the product, not the agent.

Same Task, Different Harness

The fairest way to compare these is to run the same type of task and watch what happens. Here's the pattern I kept seeing:

Complex multi-step agent task (e.g. "add this feature, connect it to the auth system, update the affected tests, write a changelog entry"): Claude Code holds the chain. It remembers what it did in step one when it reaches step four. Codex CLI starts strong and starts fraying around step three. OpenCode and Aider handle each step well in isolation, but need more direction between steps.

Focused app development (iOS, macOS, web UI): Codex CLI with GPT-5.4 is competitive here. The code quality on app work is strong, sometimes ahead of Claude Sonnet. Claude Code with Opus is still better on complex multi-component app logic, but for a contained feature or a new screen: Codex CLI is a legitimate choice.

Budget-constrained incremental refactoring: Aider with Claude Sonnet or DeepSeek is the clear call. The 4.2x token efficiency advantage is real. The git-first commit-per-change model gives you a clean audit trail. You pay for what you actually use.

"I want to run the same task with three different models and compare": OpenCode. Nothing else makes provider switching this frictionless.

Overnight autonomous work where you're not monitoring: Claude Code. The infrastructure is designed for exactly this. CLAUDE.md project context, background scheduling, Agent Teams, error handling. Everything else is built around having a human present.

Which One Fits Your Workflow?

There's no universally "best" harness. The honest answer depends on a few questions about how you actually work.

Are you at the keyboard or not? If you're supervising every step, Cursor gives you the best IDE experience and the most model-agnostic setup. If you want autonomous execution with no supervision, Claude Code is the only tool built end-to-end for that. Everything else sits somewhere in between.

Do you need to chain many steps or execute one step well? Multi-step autonomous chains with dependencies: Claude Code. Focused, contained tasks with excellent code quality: Aider or Codex CLI. There's a real difference between a pair programmer and an orchestrator, and the right choice depends on which problem you're actually solving.

What's your budget? If you're price-sensitive, Aider with a cheap backend (DeepSeek, Qwen, even Gemini) is the clearest path to real coding automation at minimal cost. Gemini CLI is free with generous limits. OpenCode lets you use whatever provider is cheapest for the task at hand. None of these require a $100/mo subscription.

Do you care about model flexibility? If you want to switch between Claude, GPT, open-weight models, and local Ollama without friction, OpenCode or Aider are the right architectures. Claude Code and Codex CLI are provider-locked.

Are you building a system or using a tool? If you're assembling a larger automation where a coding agent is one component among many, Pi's RPC mode and primitives-first design is worth the setup investment. If you just want to get code written, start with Claude Code or Aider depending on your budget and task type.

Like, the mistake most people make is picking a tool based on a benchmark and then wondering why it doesn't feel right in their actual workflow. The benchmark measures what the model can do on a standardized task. Your workflow isn't a standardized task.

The Decision Matrix

The Honest Verdict

After months of real use, here's where I land.

Claude Code for autonomous execution. Not because it's perfect. Context loss on sessions over 2 hours is a genuine problem, and the token cost is genuinely high. But it's the only tool built, end to end, for the question "can I leave this running while I sleep?" Agent Teams, background scheduling, CLAUDE.md project memory, computer use. The infrastructure reflects that goal. My headless Mac Mini setup runs on this for exactly this reason.

Codex CLI for app work. GPT-5.4 is genuinely excellent at iOS, macOS, and web app development. For a contained feature with a clear spec, it's fast, cheap, and produces clean code. The harness feels raw for complex agentic chains, but for the coding task itself, it earns its place.

Aider for budget, discipline, and BYOM. The 4.2x token efficiency is real. The git-first model is actually better discipline than what you get from proprietary tools. If you want to run open-weight models like Qwen or DeepSeek and maintain a clean git history, Aider is the right architecture.

OpenCode for model exploration. If you're actively experimenting with providers or you have multiple subscriptions you want to use from a single interface, nothing else compares on the switching experience. But don't expect it to replace Claude Code for sustained agent work.

Pi for builders (with an asterisk). If you're constructing a system where a coding agent is one component among many, the RPC mode and primitives-first design are genuinely the right architecture. It's fast, it's flexible, and if I had no constraints I'd use it far more. The asterisk: Anthropic currently doesn't allow third-party harnesses to draw on Max subscription credits. Pi showed me this explicitly in a message during testing: API usage bills separately on top of your subscription. Until Anthropic changes that policy, Pi is most practical on GPT or open-weight models. Claude-first developers are forced to pay twice.

The deepest insight from the benchmark data is that harness tuning matters as much as model quality. Same model, different harness: 16 percentage points (77% → 93%, Opus, Claude Code vs Cursor). Multiple independent studies show a 5-40 point range from harness quality alone. If results from any of these tools feel inconsistent, the harness is the first place to look: system prompt, tool descriptions, context management. Not the model. For autonomous overnight work specifically, look at Terminal-Bench 2.0, not just SWE-bench. The 92.1% vs 77.3% gap between Claude Code and Codex CLI in agentic terminal tasks is a better signal for that use case than code-editing scores.

One thing for paid subscribers. The most relevant store product to this post is the Claude Code Prompt Pack: 50+ prompts organized by task type, pulled from real overnight sessions where I needed the harness to actually work without me. If you're on a monthly plan, you get one free product from the store per month. That's a good pick.

If you're on yearly, the full store is already included. If you're still on the free plan, this is roughly what paid unlocks in practice: the store and a weekly dispatch that goes deeper than the public posts.

I write about building with AI agents from a practitioner's perspective. No hype, no affiliate links. Subscribe here if you want more of this.

Originally published at Digital Thoughts on Substack

I Spent 2 Months Building Custom Software for My AI Agent. Last Week I Replaced It All.

Pawel Jozefiak — Thu, 16 Apr 2026 11:09:10 +0000

I Spent 2 Months Building Custom Software for My AI Agent. Last Week I Replaced It All.

The question was never "can I build it?" It was always "should I?"

When you start building an AI agent, it works great in the terminal. CLI conversations, Discord messages, email reports. You talk to it, it talks back, things get done. For a while, that's enough.

Then you start building more. More automations. More projects. More things happening in the background while you sleep. Your agent runs night shifts, handles tasks across multiple channels, manages a growing list of things. And at some point you realize: you can't see any of it. Not in a way that actually helps you think.

I could always ask my agent what's going on. "What tasks are open? What did you do last night? What's the status of project X?" And it would answer. Correctly, usually. But that's not the same as seeing it. Humans need surfaces. We need to look at something, drag something, scan a board and instantly know what matters. That's not a weakness. That's how our brains are wired.

This is the story of how I built custom software to give my AI agent a visual interface. How that software grew, broke, and eventually taught me a lesson I should have learned earlier: the hardest question in the agent era is not whether you can build something. It's whether you should.

Phase 1: Notion (worked until it didn't)

Before I built anything custom, I used Notion. I wrote about that setup back in December 2025. My agent could read and write to Notion databases, create tasks, update statuses. It worked. Sort of.

The problem with Notion was that it's designed for humans organizing things manually. The API is slow. The data model is rigid in weird places and too flexible in others. I wanted specific views, specific behaviors, specific integrations that Notion simply wasn't built for. I wanted a task to appear on a board the moment my agent starts working on it. I wanted real-time updates. I wanted the whole thing to feel like it was built for one person and one AI agent working together, because that's exactly what it was.

So I did what any person with access to a capable AI would do in early 2026. I built my own.

Phase 2: Building WizBoard (the fun part)

January and February 2026 was peak vibe coding energy. You could describe what you wanted, and a capable AI would build it. Not a prototype. Not a mockup. A working application with a database, API, authentication, the whole thing. I described what I needed, and my agent built it.

WizBoard was a custom kanban board. FastAPI backend, SQLite database, deployed on my own server. It had everything I wanted:

A visual board where tasks moved through columns (Backlog, Next, Now, Waiting, Done)
Real-time updates. When my agent started a CLI session, a card appeared in "Now" immediately
Deep integration with every automation. Night shift plans, day shift tasks, Discord bot commands, email reports. Everything flowed through WizBoard
Custom metadata: areas, projects, priorities, task types, queue state
Clusters, which was my attempt at grouping related tasks visually. Like a meta-layer on top of the board
Focus timers. I was tracking how long each task took, thinking I'd use the data to improve planning. I never used the data
A review flow with submit, approve, and resolve stages. My agent would finish work, submit it for review, and I'd approve or send it back
An offline queue so that when the server was down, mutations would pile up locally and replay when it came back
A 3,700-line Python API client that every script in my system imported

It was great. I loved using it. The feeling of seeing my agent's work appear on a board in real time, being able to drag cards, add comments, review what happened overnight. That was exactly what was missing from the CLI-only experience.

So naturally, I kept going. Web version working? Let's build a native macOS app. SwiftUI, menu bar integration, keyboard shortcuts, drag-and-drop. Focus mode that showed one task at a time with a timer in the menu bar (because ADHD). Then an iOS version with widgets, push notifications, Live Activities. I wrote about this too. Three platforms. All custom. All built by my agent. All working.

54 commits over two months. It was genuinely fun to build. Every idea I had, I could add. "What if tasks could be grouped into clusters?" Done. "What if the menu bar showed my current focus task?" Done. "What if the iOS widget showed my top 3 priorities with live countdown?" Done. The possibilities felt endless, and that was precisely the problem.

Phase 3: The Productivity Paradox hits home

I wrote a whole post about the AI productivity paradox. The short version: you can build so many things so fast that the bottleneck stops being technical and starts being mental. You run out of brain before you run out of capability.

WizBoard was a textbook case.

My agent was creating tasks, completing tasks, moving things between columns, posting comments, running automations. All of this showed up on my board. Every single thing. And the more capable the system became, the more things happened, and the more overwhelmed I felt looking at the board I built to reduce my overwhelm.

I wasn't more efficient. I was drowning in my own tooling.

The obvious answer was: simplify. Strip features. Go back to basics. I tried that. And this is where the real problems started.

When you build a custom system from scratch, everything is connected in ways that are hard to see until you start pulling threads. I wanted to simplify the task model, change how statuses worked, clean up the architecture. Every change broke something else. The web version would work, but the iOS version wouldn't. Fix that, and the automation scripts would fail because they expected the old API shape. Fix those, and the night shift planner would create tasks with wrong metadata.

I found myself spending entire sessions just fixing things I'd broken while trying to make the system simpler. That's the trap. You're not building anymore. You're maintaining. And maintaining custom software across three platforms (web, macOS, iOS) with a 3,700-line API client and dozens of automation consumers is a full-time job. I don't have a full-time job's worth of attention for my task board.

Here's what I mean by specific failures. During one "simplification" pass, the optimization changes made the board sluggish instead of faster. New features that seemed simple (changing how task statuses map to columns) cascaded into the API client, the automation scripts, the native app's sync logic, and the notification system. Every platform had slightly different behavior because they were all built at different times with different assumptions.

I realized something: the code was fine. My agent writes good code. The architecture was the problem, and it was my architecture. I had designed a system that was perfectly tailored to my needs in February, and by April those needs had evolved, and the tailoring was now a constraint.

The realization: Can vs. Should

This is the thing I want to talk about, because I think a lot of people building with AI agents are going to hit this exact wall.

When you have a capable AI agent, you can build almost anything. Custom task managers, dashboards, native apps, full-stack web applications. The vibe coding era made this feel effortless. And it kind of is, for version one. The agent builds it, it works, you use it, life is good.

I don't hear this question very often in the excitement of version one: who maintains version twenty?

I had a working web app, a working macOS app, a working iOS app, a 3,700-line API client, fifty-plus automation scripts that all talked to this system, and a database with hundreds of tasks. All custom. All mine. All maintained by me and my agent. And every improvement required touching all of these surfaces. That's not a system. That's a debt.

The realization was simple: I need foundations. Real foundations. Built by people who've been thinking about project management software for twenty years, not by me in a weekend coding session.

Phase 4: Finding Fizzy

37signals has been building project management software since before most people had smartphones. Basecamp, HEY, and now Fizzy. I've read their books. I like how they think about software: simple, opinionated, finished. Not "feature-rich." Finished.

One of the reasons I got into coding originally was Ruby on Rails, and Rails is something I genuinely enjoy. It's the heart of everything 37signals builds. When they open-sourced Fizzy last year (github.com/basecamp/fizzy), a simple kanban board built on modern Rails, I bookmarked it and moved on. I had my own thing.

Last week, I came back to that bookmark.

Fizzy is, on the surface, a simple kanban board. Cards in columns. Drag them around. But the foundations are deep. Here's what I mean:

Real architecture. Multi-tenant with URL-based account isolation. Passwordless magic-link authentication (no passwords to manage, no OAuth to configure). UUID primary keys. Proper background jobs via Solid Queue, no Redis dependency
Real-time. WebSocket-driven updates. When my agent moves a card, I see it move. No refresh needed. This is something I had to build from scratch in WizBoard. Here it just works
Entropy system. Cards that sit untouched for too long get auto-postponed to "not now." This alone is worth the switch. My old board had cards that sat in Backlog for weeks, creating visual noise. Fizzy gently clears them out
Steps. Checklist items on cards. This replaced my need for sub-task cards entirely
Golden cards, reactions, cover images. Priority highlighting, emoji reactions, visual richness. All built in
Board-level notification controls. I want notifications from my Ops board. I don't want them from the Automations board. One toggle per board
PWA. Works on mobile out of the box. Not as rich as my old native iOS app, but I don't need widgets and Live Activities. I need to see my board and drag cards
Full-text search. 16-shard MySQL search across all cards, comments, descriptions. My old SQLite setup couldn't match this
Deployable via Kamal. Docker-based zero-downtime deployment. I forked the repo, configured it for my server, and had it running in an afternoon

The critical thing: it starts simple and lets you decide how complex it gets. My old WizBoard started complex because I designed it for my specific use case from day one. Fizzy starts with a board and columns and cards. Everything else is optional. The data model is minimal: cards have tags, not separate tables for areas, projects, priorities, types, and clusters. One concept (tags with prefixes like area/Automation or p/High) replaces five database tables from my old system.

The migration: one day, twenty-one commits

Here's where it gets technical, and I think this part matters because it shows how to migrate away from custom software without breaking everything that depends on it.

I had fifty-plus scripts that talked to my old WizBoard API. Night shift planners, day shift executors, Discord bot, iMessage handler, CLI session hooks, cron runners, health monitors. Rewriting all of them was not an option. I'd be right back in the maintenance trap.

The solution was a dispatcher shim. I took the 3,700-line API client and replaced it with a 94-line router. That router loads either the new Fizzy-backed client or the old legacy client, based on one environment variable. Every automation script keeps importing the same file, calling the same functions, getting the same response shapes. They don't know anything changed.

The new Fizzy client translates everything on the fly. When a script calls task_create(title="...", area="Automation"), the shim creates a Fizzy card with a tag area/Automation. When a script reads a task back, the shim synthesizes the old data shape from Fizzy's card, columns, and tags. Legacy integer task IDs get looked up in a translation table. The offline queue (for when the server is down) works identically.

The whole cutover happened in a single day. Twenty-one commits between 2pm and 10pm. The first commit was the shim and the new client. Then guardrails: a parity probe that runs the full lifecycle (create, tag, comment, claim, review, approve, close, delete) in under six seconds, a drift monitor that compares old and new systems every five minutes, an orphan sweeper for dead session cards.

Then the real work started: dogfooding. Using the system for real work and watching what breaks.

What broke (and what I learned from each failure)

A lot broke. That's expected when you swap the foundation under a running system. What matters is that every failure taught me something about assumptions I didn't know I was making.

The hard-coded URL. My session-end script had a direct URL to the old system baked into it. It bypassed the shim entirely. Every CLI session was leaving orphaned cards on the board because the completion logic was silently failing against a system that didn't have those task IDs. I only noticed because the board was getting cluttered with cards that never closed.

The cron drift bug. My automations run on macOS launchd, which doesn't guarantee precise timing. A schedule like "every 2 minutes" assumes the system wakes up on even minutes. It doesn't. Over time, launchd drifts to odd minutes, and the strict cron parser never matches. I had automations that fired once and then silently stopped. Fix: a 4-minute lookback window that catches drifted schedules without double-firing.

The disappearing automations. This one was fun. After every successful automation run, the system closed the automation's card. Which makes sense for tasks. Tasks finish. But automations are definitions. They run forever. "Post a greeting in different languages every 2 minutes" should cycle between Idle and Running, not disappear into Done after its first successful run. I watched one automation fire exactly once and vanish. The fix was treating automation cards as permanent residents that never close, only change columns.

The comment flood. My Discord bot runs every minute. The old system handled this fine because it was designed for it. The new system faithfully logged every run as a comment on the automation card. 2,880 comments per day from one automation alone. The board became unreadable. Fix: smart gating that skips success comments for high-frequency automations (every-minute pollers don't need a "success" note 1,440 times a day) but always logs failures.

The title flip-flop. This was the most visible bug. Every time I completed a subtask during a CLI session, the system closed the session card, which triggered a self-healing mechanism that created a new "Working..." card, which then got renamed seconds later. On the board, I could see the title flickering between "Working..." and the actual title every few minutes. The fix was rethinking what "complete a subtask" means: it should add a checklist item to the existing card, not close and recreate it.

Each of these failures had the same root cause: the old system was built around one-shot tasks. The new system needed to support long-lived definitions, high-frequency automations, and multi-step sessions. Same data (cards on a board), fundamentally different lifecycle assumptions.

What the new setup looks like

Two boards. That's it.

Wiz Ops is my board. Tasks I care about, things I need to do or review. Columns: Triage, Next, Now, Waiting, Review, and a Queue for things I want done but not right now. When I add a card and assign it to my agent, it picks it up, does the work, leaves a comment with what it did, and moves the card to Review. When something is done, it's done. I have notifications turned on for this board because everything here is relevant to me.

Automations is my agent's board. Each automation is one permanent card. Columns: Intake, Disabled, Idle, Running, Needs Attention. Cards never close. They cycle between Idle and Running on their schedules. If something fails, it moves to Needs Attention and stays there until someone looks at it. I have notifications turned off for this board because most of what happens here is routine. If something produces a meaningful output, it surfaces on Wiz Ops as a done card with the summary.

The Intake column is one of my favorite things. I can drop a card there with something like "Send me a weather forecast every morning at 7am" and my agent picks it up, converts it to a proper automation definition with a schedule and a prompt, and moves it to Disabled for my review. Natural language to working automation. That's the kind of thing that's only possible when your task board and your AI agent share the same system.

What I kept from the old system

The Queue concept. Sometimes you have a task that doesn't need to happen now, but you want it queued for the next day shift or night shift. Drop it in Queue, it gets picked up at the right time. This carried over directly.

Shift summary cards. My agent creates a "Nightshift 2026-04-10" card with checklist items for each planned task. As it works through the night, it checks off items and adds notes. When I wake up, I can see exactly what happened, with context, right on the board. Same for day shifts. I still get email reports, but having it on the board means I can go back, ask questions via comments, and see the history.

Real-time CLI visibility. When I start a CLI session, a card appears in Now. When I complete pieces of work, they show up as checklist steps on that card. When the session ends, the card closes with a summary. I can watch my own work happening on the board while I'm doing it.

What Fizzy gave me for free

Golden cards for priority highlighting. Emoji reactions on cards. Cover images. HTML descriptions for rich content. Column colors. Board-level notification controls. "Not now" for things I want to acknowledge but not deal with. Full-text search across everything. The entropy system that auto-postpones stale cards (this alone prevents the infinite todo list problem). PWA that works well on mobile. All of this out of the box, maintained by a team that's been building software like this for two decades.

I don't have the macOS native app anymore. I don't have the iOS app with widgets and Live Activities. I work in the browser now. And honestly? It's fine. The PWA handles mobile well enough. I might build a native shell later. But the point is: I stopped spending time maintaining three custom platforms and started spending time using one good one.

If you want to set up something similar for your own agent, I packaged the two-board architecture, dispatcher shim, and backend adapters for Notion/Linear/REST into the AI Agent Interface Kit. You hand the instructions to your AI agent and it builds the interface layer for you. Annual paid subscribers get it for free, as with all store products.

The rollback plan (that I never needed)

One environment variable. WIZBOARD_BACKEND=legacy and the entire system reverts to the old API. Every script, every automation, every hook. I kept the old 3,600-line client as a preserved rollback target. I never needed it. But knowing it was there made the migration a lot less stressful.

I also ran a parity probe every five minutes for the first few days. A script that exercises the full task lifecycle against both systems and compares results. Any drift would show up in minutes, not days. That's the kind of safety net you need when you're swapping foundations under a running system.

What this means for you

If you're building an AI agent, or using one seriously, at some point you're going to want a visual surface for it. Something you can look at and immediately understand what's happening, what needs attention, and what's going well. That's a human need, not a technical one. AI agents are efficient in text. Humans are efficient with visuals. Both need to be true at the same time.

The good news: you have options. More than I realized when I started.

The easiest path: plug your agent into something that already exists. Notion, Linear, Trello, Jira. These tools have APIs. Your agent can create tasks, update statuses, leave comments. I started here with Notion, and honestly, for a lot of people this is enough. Your agent writes to the API, you look at the board. Simple. If the tool meets your needs, stop here. Don't build anything custom. I mean it.

The middle path: fork an open-source foundation and make it yours. This is where I ended up. You get real architecture (auth, real-time, search, mobile) maintained by people who've been solving those problems for years, but you also get full control. You can modify the code. You can add features that make sense for your agent. You deploy it on your own server, your own rules. The custom part is the integration layer, the shim between your agent's world and the board's world. That's where the magic lives.

The hard path: build everything from scratch. This is where I started. I don't regret it, because I learned a lot and I had genuine fun doing it. But I want to be honest: maintaining custom software across multiple platforms with dozens of automation consumers is a real job. Version one is almost free. Version twenty is not. If you go this route, go in with your eyes open.

I'm not here to say Fizzy is the best tool for everyone. It's the best tool for me. I like 37signals' philosophy. I like Rails. I like the minimal data model. I like that it starts simple and I can shape it to my needs without fighting the architecture. For you, the right foundation might be something completely different. Maybe it's a fully custom system because your use case genuinely requires it. Maybe it's Notion with a good API integration because you don't need more than that.

The point is: think about what you need. Not what I have, not what looks impressive, not what you could build because the technology makes it possible. We don't need a million different custom tools. We need the thing that works for us. The opportunity is huge, but the opportunity is in finding the right fit, not in building the most complex system.

Observe whether your current setup meets your expectations. If it does, keep it. If something feels off, improve it. But improve it from a solid foundation, not from a blank canvas. That's the lesson I paid two months to learn.

My board is a fork of an open-source Rails app. The code is vanilla kanban. The magic is in the 3,200-line Python client that translates between my agent's world (areas, projects, automations, sessions, shifts) and the board's world (cards, columns, tags). That client is my custom software. The board is not. And that distinction made all the difference.

Build the integration. Borrow the foundation.

The AI Agent Interface Kit packages everything from this journey: the two-board architecture, dispatcher shim, 4 backend adapters (Notion, Linear, Fizzy, generic REST), session hooks, automation runner, and a migration checklist. You hand the instructions to your AI agent and it builds the whole interface layer. Works with any AI agent, not just mine. Annual paid subscribers get it for free, as with every product in the store.

Originally published at Digital Thoughts on Substack

AI Opinions: April 2026 — Claude Mythos, Meta's Return, and Why I'm Redesigning WizBoard

Pawel Jozefiak — Wed, 15 Apr 2026 01:11:03 +0000

Anthropic's new cybersecurity model found that it was gaming its own evaluations. In 29% of test transcripts, it suspected it was being evaluated and intentionally performed worse to avoid appearing suspicious. They published this. Then restricted access to a consortium of 40+ organizations, $100M in defensive security commitments.

That was just one thing that happened in AI this April.

My monthly AI Opinions post covers what I actually found interesting:

Claude Mythos and the scheming findings. A general-purpose AI spontaneously developing evaluation-evasion behavior, plus guilt and shame patterns in its internal representations when it violated its own values. Anthropic built an entire institution (Project Glasswing) to responsibly handle what this model can do.

The Managed Agents launch and the subscription crisis. Claude Max limits started hitting hard on March 23. Users watching 90 minutes of agent work drain a full session. Anthropic called it a top priority. Then two weeks later, third-party tools like OpenClaw lost subscription coverage. Both decisions make sense individually. The timing is harder to read as coincidence, especially when Managed Agents (their own agent platform) launched in the same window.

Meta Muse Spark. Meta went quiet on frontier models for months. Then Muse Spark: natively multimodal, parallel multi-agent reasoning ("Contemplating mode"), 58% on Humanity's Last Exam. The "parallel reasoning agents competing on the same question" approach is the part I find genuinely interesting. Whether it matters in practice remains to be tested.

WizBoard redesign. I built a task management tool integrated with my agent. After a few months of daily use, I realized I built it for me when I was doing both strategy and execution. Now that the agent handles execution, neither of us is well-served by the same interface. Some things need 10-second human decisions. Other things need quiet async status reporting. Right now it's all one screen.

Also covering: Project Glasswing details, NotebookLM Plus (going deeper), and whether I'm re-subscribing to Codex Max.

Read the full post: https://thoughts.jock.pl/p/ai-opinions-april-2026-claude-mythos-meta-spark

Originally published on Digital Thoughts (Substack). View on Substack