<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Pawel Jozefiak</title>
    <description>The latest articles on DEV Community by Pawel Jozefiak (@joozio).</description>
    <link>https://dev.to/joozio</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3805838%2F26cb0821-19c4-4d0c-a0df-b0a8e75e3a0d.png</url>
      <title>DEV Community: Pawel Jozefiak</title>
      <link>https://dev.to/joozio</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/joozio"/>
    <language>en</language>
    <item>
      <title>How to Build Your First AI Agent (Basics). Full Package</title>
      <dc:creator>Pawel Jozefiak</dc:creator>
      <pubDate>Thu, 16 Apr 2026 11:14:55 +0000</pubDate>
      <link>https://dev.to/joozio/how-to-build-your-first-ai-agent-basics-full-package-512k</link>
      <guid>https://dev.to/joozio/how-to-build-your-first-ai-agent-basics-full-package-512k</guid>
      <description>&lt;h1&gt;
  
  
  How to Build Your First AI Agent (Basics)
&lt;/h1&gt;

&lt;p&gt;Six months of mistakes, a real walk-through, and everything I wish someone had told me before I started.&lt;/p&gt;

&lt;p&gt;I've been building my own AI agent since October. Every mistake you can make on a first build, I've made. Some of them twice.&lt;/p&gt;

&lt;p&gt;A few days ago I asked my readers what I should write about for beginners. The answers lined up surprisingly clean. Almost everyone asked for the same thing in different words: the real stuff. What actually goes wrong. What to do on day one. How to start without feeling lost.&lt;/p&gt;

&lt;p&gt;So here it is. More structured than my usual posts, because this one is for people starting from zero. If you already have an agent running, most of this will still be useful, but the mental model is written for someone who's never done this before.&lt;/p&gt;

&lt;p&gt;One thing before we start. Mistakes aren't failure. For early adopters, they ARE the job. Everyone building in this space is hitting the same walls at the same time, because nobody has the map yet. You're not doing it wrong. You're doing it at all, which is the hard part.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. What is an AI agent, really (and why it's different from automation)
&lt;/h2&gt;

&lt;p&gt;My starting point wasn't AI. It was Zapier.&lt;/p&gt;

&lt;p&gt;I've been building classical automations for years. Zapier, n8n, make.com, custom scripts, connectors glued together with duct tape. When I started thinking about building my own agent back in October, my first instinct was to do exactly what I knew: chain tools together with a workflow builder and call it a day. I actually started that way.&lt;/p&gt;

&lt;p&gt;Honestly, for a lot of people reading this, that's still a perfectly reasonable starting point. If you've never built any kind of automation before, go make three Zaps this week. Connect your calendar to Notion. Send yourself a Slack message when an RSS feed updates. Do something small and stupid. Feel how a &lt;em&gt;trigger&lt;/em&gt; leads to an &lt;em&gt;action&lt;/em&gt; which leads to a &lt;em&gt;result&lt;/em&gt;. Those three concepts are the spine of everything that comes next.&lt;/p&gt;

&lt;p&gt;The reason I didn't stop at Zapier is the difference between an automation and an agent. An automation is deterministic. Same input, same steps, same output. You define every branch in advance. It's predictable, which is why it's trustworthy for production work.&lt;/p&gt;

&lt;p&gt;An agent has wiggle room. You give it a goal and a set of tools, and it decides how to use them. Given the same input twice, it might do slightly different things. It might also do something you didn't anticipate, because the whole point is that it can improvise. Although that sounds risky (and sometimes it is), it's also the thing that makes an agent valuable. If the tool it expected is broken, it can find a workaround or build one. A classic automation just stops.&lt;/p&gt;

&lt;p&gt;Neither one is better. They solve different problems. And honestly, most production "agents" out there are closer to classic automations with a language model glued to the top. That's fine. It works. What matters is you know which one you're building, because the failure modes are completely different.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Three questions I had to answer the long way around
&lt;/h2&gt;

&lt;p&gt;Before we touch any code, I want to borrow a framing from Zachary Wefel, who left one of the best comments on my original note. He pointed out that writers in tech tend to skip past the most basic things about how software actually exists in the world, because people around them already assume those things. He gave three questions as an example:&lt;/p&gt;

&lt;p&gt;Where does the agent live? How do you see it? How do you talk to it?&lt;/p&gt;

&lt;p&gt;I had to answer all three for myself, and I took the long way around on all of them. Here's what I learned.&lt;/p&gt;

&lt;h3&gt;
  
  
  Where does it live?
&lt;/h3&gt;

&lt;p&gt;Mine lives on a Mac Mini next to the main TV in my living room. Before that it lived on my personal MacBook for the first few months, which was fine except I needed my laptop to be on all the time for anything to run. Eventually that got annoying enough that I &lt;a href="https://thoughts.jock.pl/p/mac-mini-ai-agent-migration-headless-2026" rel="noopener noreferrer"&gt;moved it to its own dedicated machine&lt;/a&gt;. That's not a day-one problem.&lt;/p&gt;

&lt;p&gt;For your first agent, the answer is: it lives on your laptop. That's it. Your laptop is enough. An agent is just software. It lives wherever that software runs. That can be your laptop, a cheap dedicated computer in your closet, a rented cloud server, or a Raspberry Pi. Don't complicate this before you have anything running.&lt;/p&gt;

&lt;h3&gt;
  
  
  How do you see it?
&lt;/h3&gt;

&lt;p&gt;You mostly won't. There's usually no dashboard, no slick interface, no moving dials. This confuses a lot of beginners, because we're used to software having a face.&lt;/p&gt;

&lt;p&gt;You "see" an agent through what it produces. Files it writes. Messages it sends you. Things it prints in the terminal. Tasks it finishes or fails at. You can build a dashboard later if you want one (I eventually did), but on day one the agent is invisible except for its outputs.&lt;/p&gt;

&lt;h3&gt;
  
  
  How do you talk to it?
&lt;/h3&gt;

&lt;p&gt;My agent has four channels now: email, Discord, iMessage, and a task app I built for it called WizBoard. That's way more than a beginner needs. You need &lt;em&gt;one&lt;/em&gt; channel, and whatever you already use for anything else is a fine pick.&lt;/p&gt;

&lt;p&gt;The easiest first channel is the terminal on your own laptop. You type a message. It responds. That's the whole interface. It looks ugly. It's also the most powerful setup you can have for learning, because every other interface is just a fancy wrapper around that same loop.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. What you need to begin
&lt;/h2&gt;

&lt;p&gt;Before any code, before any chat, here's the kit.&lt;/p&gt;

&lt;h3&gt;
  
  
  3.1. A machine
&lt;/h3&gt;

&lt;p&gt;Your laptop is fine. Any laptop. Mac, Linux, Windows, all fine. If it can run a browser and a text editor, it can run your first agent. Don't buy anything new.&lt;/p&gt;

&lt;p&gt;Later on, if you want your agent to keep working while you sleep or while you're away from your desk, you'll eventually graduate to something that stays on. I wrote about &lt;a href="https://thoughts.jock.pl/p/mac-mini-ai-agent-migration-headless-2026" rel="noopener noreferrer"&gt;what that migration looked like for me&lt;/a&gt;, and it wasn't hard. Although it matters eventually, it's a month-three problem, not a day-one problem.&lt;/p&gt;

&lt;h3&gt;
  
  
  3.2. A subscription (or API access)
&lt;/h3&gt;

&lt;p&gt;Let me be direct about this part, because I don't see it spelled out often enough in beginner guides.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Free tiers aren't enough.&lt;/strong&gt; They cap you out fast, and you'll spend your first afternoon hitting rate limits instead of learning. This is the wrong place to save money.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A $20 per month tier is your floor.&lt;/strong&gt; Claude Pro, ChatGPT Plus, or the equivalent from whichever provider you pick. That tier is genuinely enough to build a simple first agent and get it working. You won't love it forever, but it's more than enough to start.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Power users run more than that.&lt;/strong&gt; I pay for multiple subscriptions and for API usage on top. My bill isn't small. That's a months-from-now problem. Don't worry about it yet.&lt;/p&gt;

&lt;p&gt;Like, think of the $20 as a gym membership. It's the cost of learning the skill. And honestly, it's one of the cheapest upgrades to your toolkit you'll ever make, so don't flinch at it.&lt;/p&gt;

&lt;h3&gt;
  
  
  3.3. A harness (the tool you actually work with)
&lt;/h3&gt;

&lt;p&gt;"Harness" is the word I use for the tool you sit in front of while building. There are four honest options, and all of them work:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Claude Code.&lt;/strong&gt; A terminal-based tool from Anthropic. This is what I use most days. Deep file access, built for serious building. Power user territory, but approachable.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Claude Cowork.&lt;/strong&gt; Also from Anthropic. A built-in cloud app that runs Claude in an agent loop without you ever touching a terminal. If the word "terminal" already makes you nervous, this is probably where you should start. It's genuinely good enough to build your first real agent in, and you can always graduate to Claude Code later.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Codex&lt;/strong&gt; (or the equivalent from another provider). Same category as Claude Code, different flavor.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;A plain AI chat&lt;/strong&gt; like Claude.ai or ChatGPT in your browser. Yes, you can genuinely start here. You'll be copy-pasting more, but it works completely.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Pick one. Don't spend a week comparison-shopping. The differences don't matter until you've actually built something and know what you need. I wrote a longer piece on &lt;a href="https://thoughts.jock.pl/p/claude-code-source-leak-what-to-learn-ai-agents-2026" rel="noopener noreferrer"&gt;what's actually worth learning from a harness like Claude Code&lt;/a&gt; if you want a deeper take. But for today, pick one and move on.&lt;/p&gt;

&lt;h3&gt;
  
  
  3.4. A folder (this is THE architecture)
&lt;/h3&gt;

&lt;p&gt;Here's the mental model that took me three months to see clearly. If you take it seriously, it'll save you those three months.&lt;/p&gt;

&lt;p&gt;The architecture of your AI agent IS its folder structure.&lt;/p&gt;

&lt;p&gt;That's it. There is no hidden magic layer. Every functional piece of an AI agent lives as a file in a folder on your computer. When someone online says "the agent has tools," what they really mean is: there are scripts in a folder that the agent knows how to run. When someone says "the agent has memory," they mean: there are markdown files it reads at the start of each session. When someone says "the agent has an instruction set," they mean: there's a file called something like CLAUDE.md or agents.md that tells it who it is and what the rules are.&lt;/p&gt;

&lt;p&gt;It's all files. That's the whole trick. Once you see the folder as the architecture, the mystery goes away.&lt;/p&gt;

&lt;p&gt;Here's what a beginner's agent folder looks like in practice:&lt;/p&gt;

&lt;p&gt;my-agent/&lt;br&gt;
├── CLAUDE.md              ← instructions (the brain)&lt;br&gt;
├── memory/&lt;br&gt;
│   └── notes.md           ← what the agent remembers&lt;br&gt;
├── projects/&lt;br&gt;
│   └── morning-email/&lt;br&gt;
│       ├── fetch-email    ← the part that pulls your email&lt;br&gt;
│       └── prompt.md      ← how you want it summarized&lt;br&gt;
├── scripts/               ← small helper scripts&lt;br&gt;
└── secrets/               ← API keys, passwords (keep this safe)&lt;/p&gt;

&lt;p&gt;Read that tree slowly. Every concept maps cleanly to a file or folder:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Instructions&lt;/strong&gt; live in CLAUDE.md or agents.md depending on your harness.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Memory&lt;/strong&gt; lives in markdown files inside memory/.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Tools&lt;/strong&gt; (what the agent can do) are scripts inside scripts/ or inside each project folder.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Projects&lt;/strong&gt; live as subfolders under projects/.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Credentials&lt;/strong&gt; (passwords, API keys) live in a protected secrets/ folder.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When you look at an AI agent this way, it stops being a mysterious entity and starts being something very familiar: a folder with text files in it. I wrote about &lt;a href="https://thoughts.jock.pl/p/how-i-structure-claude-md-after-1000-sessions" rel="noopener noreferrer"&gt;how I structure the CLAUDE.md file itself after more than a thousand sessions&lt;/a&gt;, and that file is the single most important thing you will own. For now, just sit with the idea: the whole agent is a folder.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Build your first agent, step by step
&lt;/h2&gt;

&lt;p&gt;Enough theory. I want you to finish this post with a real working agent, not just an understanding. I'm going to walk through the exact project I recommend for a first build: &lt;em&gt;an agent that reads your overnight email and writes you a one-paragraph morning summary.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I picked this one on purpose. It's small enough to finish in an afternoon. It's real enough that you'll actually use it tomorrow. And it'll make you hit most of the real challenges in building any agent: authentication, permissions, context, prompt design, error handling. You'll learn more from building this than from reading any number of articles about it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1. Decide what you want (fifteen minutes, no code)
&lt;/h3&gt;

&lt;p&gt;Open your chat tool of choice. Not to write code yet. Just to think out loud. Describe your morning:&lt;/p&gt;

&lt;p&gt;Every morning I open my email. I scan 40 messages. I figure out which three actually matter. I want a one-paragraph summary of the important stuff before my coffee is done.&lt;/p&gt;

&lt;p&gt;That's your spec. Keep it this short. If you can't explain what you want in one honest paragraph, you don't understand what you want yet, and the agent isn't going to save you from that. Better to figure it out before you write a line of code.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2. Create the folder (five minutes)
&lt;/h3&gt;

&lt;p&gt;Make an empty folder on your computer. Call it my-agent. Inside it, create the skeleton:&lt;/p&gt;

&lt;p&gt;my-agent/&lt;br&gt;
├── CLAUDE.md&lt;br&gt;
├── memory/&lt;br&gt;
├── projects/morning-email/&lt;br&gt;
├── scripts/&lt;br&gt;
└── secrets/&lt;/p&gt;

&lt;p&gt;Empty folders are fine. We'll fill them as we go. The only reason to make them now is so your agent has a place to put things.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3. Let the AI draft your instructions file (ten minutes)
&lt;/h3&gt;

&lt;p&gt;If you're using Claude Code, there's an even shorter way to start. From inside your empty my-agent folder, run the /init command. Claude Code looks around, figures out what it's dealing with, and drops an initial CLAUDE.md in there for you. That's your starting point. One command, done.&lt;/p&gt;

&lt;p&gt;If you're in a different harness or a plain chat, type something like:&lt;/p&gt;

&lt;p&gt;I want to build an AI agent whose first job is to read my email inbox every morning and write me a one-paragraph summary of what matters. Draft a CLAUDE.md instructions file for it. Keep it under 50 lines. Don't assume anything about my setup.&lt;/p&gt;

&lt;p&gt;Either way, you'll end up with a file called CLAUDE.md inside your folder. That's the starting version. It will be rough. That's fine.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4. READ the CLAUDE.md (this is the most important step in this entire post)
&lt;/h3&gt;

&lt;p&gt;I'm not joking. This one step is worth more than the other seven combined.&lt;/p&gt;

&lt;p&gt;Open the file the AI just wrote. Read every line. Ask yourself:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Does this actually describe what I want?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Are there weird assumptions baked in that I didn't ask for?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Does the voice sound like me, or like corporate blog filler?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Is there anything in here that surprises me?&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Edit it until it reads like &lt;em&gt;you&lt;/em&gt; wrote it. Remove anything you don't understand. Add anything the model forgot. This file is the brain of your agent. If it's wrong, every single thing downstream of it will also be wrong, and you'll spend hours later chasing a ghost that started right here on day one. More on why in the mistakes section.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 5. Tell it what to automate (around thirty minutes)
&lt;/h3&gt;

&lt;p&gt;Now the actual building. Here's the key thing to understand, and it's the reason I'm not writing out a bunch of code for you to copy: you don't have to. You can just describe what you want in plain language, and the harness will figure out the rest.&lt;/p&gt;

&lt;p&gt;Back to your harness. Say something like:&lt;/p&gt;

&lt;p&gt;I want the first thing in projects/morning-email to read my email inbox, pull the last 12 hours of unread messages, and hand them off to be summarized. The end result should be a one-paragraph summary of what actually matters. Figure out the best way to do this on my setup and walk me through it step by step.&lt;/p&gt;

&lt;p&gt;That's it. That's the entire prompt. No code, no jargon, no pretending you know what a shell script is.&lt;/p&gt;

&lt;p&gt;A good harness, which is all of them these days, will then ask you follow-up questions. What email provider do you use? Mac, Windows, or Linux? Do you already have API credentials? Do you want this to run on a schedule, or only when you ask for it? It'll figure out the right tool for the job and explain each step as it goes. You just answer the questions honestly.&lt;/p&gt;

&lt;p&gt;This is the real difference between working with an agent and writing code from scratch. You're not supposed to know in advance what tool or file format or library it's going to use. That's its job. Your job is to know what you want and to check the output when it lands.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 6. Let it build, but put the AI call at the END of the pipeline
&lt;/h3&gt;

&lt;p&gt;While your harness is building, there's one thing to steer. This might be the biggest efficiency lesson in the whole post: &lt;strong&gt;AI doesn't belong in every step of the pipeline.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Your agent is going to fetch email. Fetching email is a problem boring, non-AI code has solved for 30 years. You don't need a language model for that part. The only part that actually needs a language model is the summarizing, because that's the part that requires understanding the content.&lt;/p&gt;

&lt;p&gt;So tell the harness explicitly:&lt;/p&gt;

&lt;p&gt;Keep AI out of the fetch step. Use whatever normal tool is appropriate there. Only use the language model at the very end, for the summarization itself. One call total, not one per email.&lt;/p&gt;

&lt;p&gt;It'll handle this correctly if you ask for it. Usually it won't volunteer to do it this way, because stuffing an LLM into every step feels more impressive and uses more tokens. You'll thank yourself later. I wrote a whole piece on &lt;a href="https://thoughts.jock.pl/p/automation-guide-2025-ten-rules-when-to-automate" rel="noopener noreferrer"&gt;when to use AI and when to just use normal code&lt;/a&gt;, and the rule from that post applies directly here: use AI where judgment or language actually matters, and use plain tools for everything else.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 7. Run it (five minutes)
&lt;/h3&gt;

&lt;p&gt;Now run the thing you just built. There are two honest ways to do this, depending on how comfortable you are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The non-technical way:&lt;/strong&gt; just ask your agent to run it for you. In Claude Code, Claude Cowork, or Codex, you can literally say "run my morning email agent" and it'll execute the thing it just built and show you the result. This is the easiest path if you're not comfortable in a terminal. It works. Use it.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The technical way:&lt;/strong&gt; if you like knowing exactly what's happening, ask the harness "what command do I run to execute this myself?" and it'll give you the one-liner to paste into your terminal. Then you're running it directly, no agent in the loop.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Either way, you should see your morning summary print out. If you see it, you just built an AI agent. Congratulations. Go make coffee.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 8. When it breaks (this is where the real learning is)
&lt;/h3&gt;

&lt;p&gt;It will break. Something won't authenticate, or the summary will be garbage, or it'll pull emails from the wrong time window. Good. This is the part you can't skip, and it's where the actual learning happens.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Read the error literally. Don't panic. Paste the whole thing back into your harness and ask it to explain what happened and what to try next.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;If the behavior keeps drifting from what you want, the problem is almost always in CLAUDE.md. Go back and fix the instructions there first.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;If the summary is the wrong shape or tone, fix the summarization prompt.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;If no data is coming through at all, the problem is earlier in the pipeline, and the agent can usually diagnose this for you in two or three back-and-forths.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's it. You have a real agent now. It's small, it's yours, and it does one thing you actually care about. Everything else in the rest of this post is about what will bite you as you grow it into something bigger.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. The mistakes I made (so you can skip them)
&lt;/h2&gt;

&lt;p&gt;This is the section my readers asked for the loudest. Opinion AI, who left the top comment on my original note, put it better than I could:&lt;/p&gt;

&lt;p&gt;Would love to see you cover the mistakes people make on their first agent build. The "what not to do" part is always more useful than the setup guide, and almost nobody writes about it.&lt;/p&gt;

&lt;p&gt;Agreed. Here are the ones I actually hit.&lt;/p&gt;

&lt;h3&gt;
  
  
  Mistake 1. Trusting the AI blindly to write your instructions file
&lt;/h3&gt;

&lt;p&gt;Back in October, I was in a hurry. I let the AI generate my first CLAUDE.md and didn't read it carefully. I ran with it. Things worked, sort of. Then the agent started doing weird things I hadn't asked for. Small weirdness at first. Then bigger.&lt;/p&gt;

&lt;p&gt;I spent hours, maybe days, chasing ghosts. Poking at different parts of the architecture. Swapping tools. Adjusting prompts. Burning billions of tokens trying to figure out what was happening. The root cause turned out to be a single misguided sentence near the top of the instructions file that I hadn't bothered to read on day one.&lt;/p&gt;

&lt;p&gt;The rule is simple and I'll repeat it because it matters: &lt;strong&gt;you can use AI to generate your instructions. You can't skip reading them. Ever.&lt;/strong&gt; Read every line at least once. Edit until it sounds like you wrote it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Mistake 2. Letting self-improvement run wild on the core files
&lt;/h3&gt;

&lt;p&gt;Some time later, I built a self-improving layer. The agent could look at its own behavior, notice patterns, and update its own instructions. Technically brilliant. I was proud of it.&lt;/p&gt;

&lt;p&gt;I also forgot to tell it which files it was allowed to touch.&lt;/p&gt;

&lt;p&gt;Within a few days it had rewritten large parts of the core CLAUDE.md in ways I'd never sanctioned. The agent started drifting in five directions at once. Things I had explicitly told it to do were getting silently overwritten by its own "improvements." Although I was proud of the self-improvement layer as an idea, I had to roll a lot of it back and rebuild it from scratch.&lt;/p&gt;

&lt;p&gt;The fix was about scope. Each project in my agent now has its own small instruction file and its own little memory file. When self-improvement runs, it touches those leaf files, not the core. The trunk stays protected. The branches can grow. I eventually wrote a longer piece on &lt;a href="https://thoughts.jock.pl/p/wiz-ai-agent-self-improvement-architecture" rel="noopener noreferrer"&gt;the full self-improvement architecture&lt;/a&gt; if you want the deep version. For a beginner, the takeaway is simpler: never let any automated process write directly to the core instructions file. Ever.&lt;/p&gt;

&lt;h3&gt;
  
  
  Mistake 3. Ignoring open source out of pride
&lt;/h3&gt;

&lt;p&gt;I wanted to build the whole thing myself. I refused to look at what other people were doing on GitHub. I told myself I didn't want to be influenced.&lt;/p&gt;

&lt;p&gt;That cost me two or three months.&lt;/p&gt;

&lt;p&gt;Around month three I finally caved and started reading other people's agent repos. Not to copy the architecture (which usually wouldn't fit anyway), but to steal &lt;em&gt;concepts&lt;/em&gt;. One example: I found a file called SOUL.md in an open source project. I'd only been using CLAUDE.md at that point, trying to cram every aspect of the agent into one file. SOUL.md turned out to be a dedicated place for personalization: values, voice, what the agent is &lt;em&gt;like&lt;/em&gt; as a personality. That small idea opened up a whole layer for me that I'd been clumsily stuffing into the main instructions. I was a better agent designer the day after I read it than I was the day before.&lt;/p&gt;

&lt;p&gt;Bianca Schulz asked about open source frameworks in the comments on my note, and here's the honest answer: read them, borrow concepts, don't feel obligated to adopt any single one of them wholesale. Your agent doesn't need to look like anyone else's. But you should know what the good ones are doing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Mistake 4. Using the strongest model for every single task
&lt;/h3&gt;

&lt;p&gt;For a long time I was running Opus on everything. Every small query. Every file read. Every trivial check. I'd hit my usage limit before lunch and then panic.&lt;/p&gt;

&lt;p&gt;The fix is something I now call model routing, and it cut my usage dramatically:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Fast and simple stuff&lt;/strong&gt; goes to a small model, often a local llm now. Before that I was using Haiku.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;General work, planning, most coding&lt;/strong&gt; goes to a mid-tier model. For me that's Sonnet 4.6. This is where most of the work happens.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Hard reasoning, critical code, strategic decisions&lt;/strong&gt; go to Opus 4.6.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I wrote in detail about &lt;a href="https://thoughts.jock.pl/p/claude-model-optimization-opus-haiku-ai-agent-costs-2026" rel="noopener noreferrer"&gt;why this switch made the agent both cheaper and better&lt;/a&gt;. Short version: nobody is going to optimize your usage for you. You have to do it yourself, and you should do it earlier than I did.&lt;/p&gt;

&lt;h3&gt;
  
  
  Mistake 5. Trying to build Jarvis on day one
&lt;/h3&gt;

&lt;p&gt;If I'm being completely honest, my original fantasy was Jarvis from Iron Man. One agent that solved everything, ran my whole life, handled the business, wrote the blog, managed the calendar, raised the kid. The whole thing. From day one.&lt;/p&gt;

&lt;p&gt;That was the real mistake, and basically everything else downstream of it was a consequence. I started with expectations that were impossible to meet in week one, so I kept pushing the architecture too hard and too fast. I'd add five features at once when I should've added one and let it settle. Although I did get a fully autonomous version working eventually, I had to roll a lot of it back.&lt;/p&gt;

&lt;p&gt;The version that actually works, the one I have now, is the one I should've been building from the start: incremental. One small task. Then the next. Then the next. The big Jarvis-like thing did emerge eventually, but as a side effect of building a hundred small working pieces, not as a top-down design.&lt;/p&gt;

&lt;p&gt;Full autonomy without taste isn't really what you want, either. The problem with a fully autonomous agent isn't that it can't do things. It's that it has no way of knowing whether the thing it just produced is actually good, because the thing that decides "good" is usually you. Your standards. Your instincts. Your sense of what's off.&lt;/p&gt;

&lt;p&gt;My agent is still autonomous for a large set of predictable tasks: morning reports, evening summaries, urgent flags, inbox triage, some experiments. Anything where the shape of "good" is well-defined. For anything creative, strategic, or quality-sensitive, I'm firmly in the loop.&lt;/p&gt;

&lt;p&gt;Think of an agent as a partner, not a solver. And don't try to build Jarvis on day one. Build one small, honest thing that works, then build the next one on top of it. That's the only order of operations that actually converges.&lt;/p&gt;

&lt;h3&gt;
  
  
  Mistake 6. Putting AI in every step of every pipeline
&lt;/h3&gt;

&lt;p&gt;Early on, every single thing my agent did had a language model call somewhere in it. Fetching data. Moving files. Routing messages. Formatting output. LLM everywhere, because LLMs felt magical and I wanted to use them for everything.&lt;/p&gt;

&lt;p&gt;One morning I noticed I was at 50% of my 5-hour usage window before I'd actually done any real work. Just from the agent's own background tasks waking up.&lt;/p&gt;

&lt;p&gt;The fix was boring and obvious in hindsight: &lt;strong&gt;most of a pipeline can be a plain script.&lt;/strong&gt; Move data from A to B with a script. Call the model exactly once, at the end, for the one thing that actually requires language. That's what the model is for. Everything before that is plumbing, and plumbing should be code.&lt;/p&gt;

&lt;p&gt;AI isn't free. Even local models cost time, electricity, and capacity. You don't need AI everywhere. You need it where the language or the judgment actually matters.&lt;/p&gt;

&lt;h3&gt;
  
  
  Mistake 7. Forgetting that your harness updates constantly
&lt;/h3&gt;

&lt;p&gt;Claude Code updates almost daily. Codex updates often. Every harness does. This is mostly a good thing, except for one small catch: features you built from scratch will sometimes get shipped natively by the tool you're building on, and now you have the same thing twice. Your custom version and the new native version start fighting each other, and the output drifts in ways that are hard to diagnose.&lt;/p&gt;

&lt;p&gt;My fix was a small automation that checks for updates every day and flags anything in my custom code that overlaps with new native features. When it finds one, I delete my version and use the native one. Cleaner, less code to maintain, better integration.&lt;/p&gt;

&lt;p&gt;If you don't do something like this, after a few weeks you'll notice things wiggling and conflicting and you won't know why. The harness moved under your feet. It's the cost of building on a fast-moving platform, and you just have to pay attention to it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Mistake 8. Installing skills from a marketplace without checking them
&lt;/h3&gt;

&lt;p&gt;This one is newer, because skill marketplaces and shareable agent extensions are newer. Claude Code now has a growing ecosystem of skills you can drop into your agent. Other harnesses have similar things. The idea is great: someone else already solved a problem you have, you install their skill, you save hours.&lt;/p&gt;

&lt;p&gt;The catch is that a skill is code that runs on your machine with your agent's permissions. If you install one without understanding what it does, you've effectively given a stranger a seat at the table inside your setup. Most skills are fine. Some aren't. I already wrote about &lt;a href="https://thoughts.jock.pl/p/claude-skill-auditor-security-scanner-claude-code-2026" rel="noopener noreferrer"&gt;a case where malware was hidden inside a Claude Code skill&lt;/a&gt;, which is why I built a scanner for them in the first place.&lt;/p&gt;

&lt;p&gt;The rule I follow now, and the one I'd give you from day one: before installing any skill from any marketplace, ask yourself two questions. &lt;strong&gt;Do I actually need this, or am I installing it because it's there?&lt;/strong&gt; And &lt;strong&gt;do I understand, at least roughly, what it's allowed to do?&lt;/strong&gt; If you can't answer both, don't install it yet. Read its source. Ask your agent to walk you through what it does. Treat it like any piece of software from someone you've never met, because that's what it is.&lt;/p&gt;

&lt;h3&gt;
  
  
  Mistake 9. Not using Git from day one (the mistake I'm glad I didn't make)
&lt;/h3&gt;

&lt;p&gt;I want to be honest here: this one isn't actually my mistake. I started using Git from the very beginning on every agent project I've ever built, and that single habit has saved me more times than I can count. I'm including it because the number of beginners I've watched skip it and then lose weeks of work is too high to leave out.&lt;/p&gt;

&lt;p&gt;Git is the thing that lets you roll back to a working version when something goes wrong. And something will go wrong. Your agent will make a change to a file you didn't expect. You'll delete the wrong folder. You'll let the model rewrite something that was working and discover two days later that the new version is worse. Without Git, you're stuck trying to remember what the file looked like three days ago. With Git, you type one command and you're back.&lt;/p&gt;

&lt;p&gt;The good news is this is now genuinely easy, even for non-technical people. You can ask your harness to set up a Git repository for you and it'll do the whole thing. Private repo on GitHub is free and fine. You can even set up an automation so that every time your agent finishes a meaningful task, it commits and pushes the current state to the repo automatically, which means you basically never lose work. I set mine up like that and I haven't thought about it since.&lt;/p&gt;

&lt;p&gt;If you remember nothing else from this section, remember this: &lt;strong&gt;commit and push every working version of your agent, from the very first day.&lt;/strong&gt; It's the cheapest insurance policy in the whole setup, and every single person who has ever lost work to a runaway edit wishes they'd done it sooner.&lt;/p&gt;

&lt;h3&gt;
  
  
  Bonus mistake. Thinking you need to build alone
&lt;/h3&gt;

&lt;p&gt;I'll say this honestly because I lived it: building an agent in isolation is much slower than building one while reading what other people are running into. Communities, newsletters, GitHub discussions, random Substack notes at midnight. The people doing this work are almost all willing to share what they're learning. Go find them. I learned some of the most important things I know from comments on my own posts, which is the only reason this post exists at all.&lt;/p&gt;

&lt;h2&gt;
  
  
  6. Context window is the whole game
&lt;/h2&gt;

&lt;p&gt;hohoda in the comments on my original note nailed something I think about constantly:&lt;/p&gt;

&lt;p&gt;The context window is the real constraint. Everything else, tools, models, memory, is downstream of how well you manage what the agent sees at any given moment.&lt;/p&gt;

&lt;p&gt;200,000 tokens sounds like a lot. It isn't, once you understand what fills it.&lt;/p&gt;

&lt;p&gt;Every session auto-loads a bunch of stuff before you've even typed anything: your core instructions file, your memory files, the conversation history if there is any, the current task state. That's your "always-on" overhead. For me, that adds up fast. It's a cost I didn't fully understand at first, because it happens before you see a single response.&lt;/p&gt;

&lt;p&gt;For a beginner, three rules carry you a long way:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Keep your CLAUDE.md thin.&lt;/strong&gt; Every line you add is a line the model has to read at the start of every single session. Treat it like precious real estate. If you can say it shorter, say it shorter.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;One memory file per project, and that's it.&lt;/strong&gt; Don't build a vector database. Don't install a semantic search engine. Don't set up a temporal knowledge graph. Not on day one. A flat markdown file per project is enough for a surprisingly long time. That's how I started and it worked for months.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Don't worry about compaction yet.&lt;/strong&gt; Eventually, once your memory files get large, you might want a process that rewrites them to stay under a size threshold. I run one every night now. That's a month-three problem, not a day-one problem.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For almost any beginner project, 200k tokens is more than enough. A back-and-forth conversation over iMessage barely touches the budget. The failure mode is almost never "model context too small." It's "my CLAUDE.md bloated to 800 lines and now every session starts with a giant anchor around its neck."&lt;/p&gt;

&lt;p&gt;I wrote a longer piece on &lt;a href="https://thoughts.jock.pl/p/how-i-structure-claude-md-after-1000-sessions" rel="noopener noreferrer"&gt;how I keep my own CLAUDE.md structured after a thousand plus sessions&lt;/a&gt; if you want to see the mature version. For now, just remember: thin instructions, one memory file per project, and context is the first thing that'll bite you when the agent starts behaving strangely.&lt;/p&gt;

&lt;h2&gt;
  
  
  7. Security from day one
&lt;/h2&gt;

&lt;p&gt;Bianca Schulz asked about security on my note, and this is the section I think about the most when I write pieces like this. It was one of the biggest reasons I built my own agent instead of using an off-the-shelf one.&lt;/p&gt;

&lt;p&gt;Here's the thing: an AI agent is a new attack surface on your computer. It has permissions. It runs code. It reads your files. It talks to the internet. And because we're still early in how this all works, the models that drive it can be tricked, manipulated, or prompt-injected in ways we don't fully understand yet. You're adding a new thing with a lot of power to your machine, and you should act like that.&lt;/p&gt;

&lt;p&gt;My progression was deliberate, and I'd recommend something similar for you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;MacBook phase.&lt;/strong&gt; Very restricted permissions. Only the folders I explicitly whitelisted. No blanket network access. No access to real credentials. I built slowly and paid attention to what the agent actually needed. My personal machine has my personal things on it, and I wasn't about to let a half-built agent loose in there.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Learning phase.&lt;/strong&gt; As I understood what the agent actually needed and could trust it with, I expanded its permissions carefully.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Dedicated machine phase.&lt;/strong&gt; Eventually I moved it to its own Mac Mini. An isolated computer, dedicated to the agent, with its own accounts and its own credentials. That machine is where the agent has broad permissions. My personal laptop doesn't, and never will again.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A rule I learned the hard way and will give you for free: &lt;strong&gt;the agent should have its own accounts, not yours.&lt;/strong&gt; Its own email address. Its own API keys. Its own logins. Don't share your personal credentials with it. When something goes wrong, and something will eventually go wrong, you want the blast radius to be contained.&lt;/p&gt;

&lt;p&gt;Two months ago I launched a small tool called &lt;a href="https://thoughts.jock.pl/p/claude-skill-auditor-security-scanner-claude-code-2026" rel="noopener noreferrer"&gt;a security scanner for Claude Code skills&lt;/a&gt;, which hit the front page of Hacker News. I built it because I was reading stories about autonomous agents being exploited in the wild and realized I wanted a way to check my own setup against a list of known issues. If you're running anything serious, something like this is worth having in your toolbox. And even if you're not, just paying attention to permissions from day one will put you ahead of almost everyone else building in this space.&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing. Start small, start today
&lt;/h2&gt;

&lt;p&gt;You don't need the strongest model. You don't need a fancy framework. You don't need a PhD in machine learning or expensive hardware or a cloud account.&lt;/p&gt;

&lt;p&gt;You need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;A laptop you already own.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;A $20 per month subscription to a real model.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;A harness. Any harness. Pick one.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;A folder on your computer, with CLAUDE.md, a memory/ subfolder, a projects/ subfolder, and a secrets/ subfolder.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;One real project you actually want to exist. Not a demo. Something you'd use tomorrow morning.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Start with that. The rest (all the architecture and the self-improvement and the model routing and the memory compaction) comes as you grow into it. None of it needs to exist on day one.&lt;/p&gt;

&lt;p&gt;Everything will break regularly. Your harness will update under your feet. Your instructions file will drift. Your context window will bloat. The model will hallucinate a function that doesn't exist and confidently insist it does. Although it cost me a lot of time at the beginning, I really don't mind it anymore. It's the job right now, and I accept that. &lt;a href="https://thoughts.jock.pl/p/my-ai-agent-works-night-shifts-builds" rel="noopener noreferrer"&gt;I wrote my first piece about Wiz back when it was just a night-shift experiment&lt;/a&gt;, and looking back, almost everything I thought I knew then was wrong. That's fine. The only thing that compounds is the habit of building, breaking things, fixing them, and writing down what you learned.&lt;/p&gt;

&lt;p&gt;The people in my comments who asked for this post already know more than most. Almost all of you have the instinct, and most of you have the tools. What's left is the part I can't do for you: opening the folder, writing the first line of CLAUDE.md, and running something small tonight that didn't exist this morning.&lt;/p&gt;

&lt;p&gt;Go build your first agent. Then tell me what broke.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;I write about building Wiz, my AI agent, roughly twice a week on Digital Thoughts. Every mistake, every rebuild, every thing that surprised me along the way. If this post was useful, subscribe and you'll get the next one as soon as it goes out.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://thoughts.jock.pl/p/how-to-build-your-first-ai-agent-beginners-guide-2026" rel="noopener noreferrer"&gt;Digital Thoughts on Substack&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>beginners</category>
      <category>agents</category>
      <category>automation</category>
    </item>
    <item>
      <title>Claude Code vs Codex CLI vs Aider vs OpenCode vs Pi vs Cursor: Which AI Coding Harness Actually Works Without You?</title>
      <dc:creator>Pawel Jozefiak</dc:creator>
      <pubDate>Thu, 16 Apr 2026 11:09:35 +0000</pubDate>
      <link>https://dev.to/joozio/claude-code-vs-codex-cli-vs-aider-vs-opencode-vs-pi-vs-cursor-which-ai-coding-harness-actually-79l</link>
      <guid>https://dev.to/joozio/claude-code-vs-codex-cli-vs-aider-vs-opencode-vs-pi-vs-cursor-which-ai-coding-harness-actually-79l</guid>
      <description>&lt;h1&gt;
  
  
  Claude Code vs Codex CLI vs Aider vs OpenCode vs Pi vs Cursor: Which AI Coding Harness Actually Works Without You?
&lt;/h1&gt;

&lt;p&gt;My AI agent &lt;a href="https://thoughts.jock.pl/p/building-ai-agent-night-shifts-ep1" rel="noopener noreferrer"&gt;wakes up at 2am, picks tasks from a queue, ships code, and sends me a report by morning&lt;/a&gt;. For that to work, I need a coding harness I can trust when I'm not watching.&lt;/p&gt;

&lt;p&gt;Not a tool that helps me code faster. A tool that codes when I'm asleep.&lt;/p&gt;

&lt;p&gt;That's a different question than "which IDE is best." IDEs are for humans who are present. Harnesses are for when you're not. It's also not the same question as "which has the best autocomplete." That's a different category entirely, one we're not touching here.&lt;/p&gt;

&lt;p&gt;I've used Claude Code daily for months, run Codex CLI and OpenCode in parallel, tested Pi, and dug into the open-source alternatives. This is what I actually think.&lt;/p&gt;

&lt;h2&gt;
  
  
  What a Harness Actually Is
&lt;/h2&gt;

&lt;p&gt;A harness connects the horse to the cart. In AI coding, it's the set of tools and environment in which the agent operates.&lt;/p&gt;

&lt;p&gt;Here's the thing most people miss: LLMs can only generate text. That's it. They can't read your files, run commands, or edit code directly. What a harness does is give the model structured tool calls it can emit as text. The harness intercepts those, executes them with real code, appends the output to the conversation history, and prompts the model to continue. Every tool call follows the same loop: model pauses, harness runs something, result added to context, model restarts. At its core this is about 60-75 lines of Python. The complexity is entirely in the tuning: what tools the model gets, how those tools are described, and what the system prompt says.&lt;/p&gt;

&lt;p&gt;This matters because the tuning is where harnesses actually diverge. Two harnesses running the same model on the same task can produce dramatically different results. Not because of the model, but because of what the harness tells the model it can do and how to use it.&lt;/p&gt;

&lt;p&gt;Tab autocomplete isn't a harness. It's a suggestion box. A nice UI on top of an existing harness (like T3 Code, which wraps Claude Code and Codex CLI) is also not a harness. The real question for every tool below: can it take a task, execute it end-to-end across multiple files, handle errors, and report back without me in the loop?&lt;/p&gt;

&lt;h2&gt;
  
  
  Two Different Categories: Coding Tools vs Agent Orchestrators
&lt;/h2&gt;

&lt;p&gt;Before comparing specific tools, it's worth naming the split that most comparisons ignore. Not all "AI coding harnesses" are trying to do the same thing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Coding tools&lt;/strong&gt; are pair programmers. You direct each step. They execute that step very well, commit the result, and wait for the next instruction. Aider is the clearest example. Codex CLI leans this way too. Cline. These are tools built around the assumption that you're at the keyboard and providing direction. They make individual tasks faster and better. They're not designed to chain 40 decisions together autonomously while you sleep.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent orchestrators&lt;/strong&gt; are designed to take a goal and execute autonomously across multiple steps, files, and decision points. Claude Code is built for this. Devin is the extreme version. Pi, if you build out the harness fully, fits here. These tools are designed around the assumption that you're not watching, and they need to make judgment calls without asking.&lt;/p&gt;

&lt;p&gt;Most comparisons treat all of these as the same thing and rank them on the same axis. That produces misleading results. Aider isn't trying to replace Claude Code for overnight autonomous runs. Codex CLI isn't trying to be an agent orchestrator in the same sense Claude Code is. Judging them by the same criteria produces noise.&lt;/p&gt;

&lt;p&gt;The honest answer to "which is best" depends entirely on which category you need. This post tries to be clear about which tools belong where, and let you make the call for your workflow.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Benchmark Reality (And Why It Doesn't Tell the Full Story)
&lt;/h2&gt;

&lt;p&gt;SWE-bench Verified became the standard benchmark for this category. It measures how often a coding agent independently resolves real GitHub issues from start to finish. That status also made it a target. Researchers flagged contamination: training data for newer models overlaps with the test set, which inflates scores. The cleaner alternative is &lt;strong&gt;SWE-bench Pro&lt;/strong&gt;, introduced in 2026, with 2,000+ problems that weren't in any public training data. GPT-5.4-Codex leads there at 56.8%. Harder problems, more honest scores.&lt;/p&gt;

&lt;p&gt;Terminal-Bench 2.0 deserves a separate mention because it's more relevant for agentic tasks than SWE-bench. It tests autonomous, multi-step execution in real terminal environments. Not just code edits. Actual shell navigation, file management, running commands in sequence, recovering from errors. The Claude Code harness configuration benchmarked here ("Claude Mythos") hits 92.1%. Codex CLI hits 77.3%. That 15-point gap is a better signal for overnight autonomous work than SWE-bench numbers.&lt;/p&gt;

&lt;p&gt;Now the result that breaks the "pick the highest number" logic. Matt Mayer ran an independent test comparing the same model inside different harnesses. Claude Opus: 77% in Claude Code, 93% in Cursor. Same model. Same tasks. 16 percentage points from the harness alone. That's not an outlier. CORE-Bench found Claude Opus at 42% with a minimal scaffold, rising to 78% inside Claude Code's full harness. Across multiple independent studies the harness effect ranges from 5 to 40 percentage points depending on model and task type.&lt;/p&gt;

&lt;p&gt;A few flags before reading the tool sections. Cursor doesn't publish SWE-bench Verified results and uses its own proprietary CursorBench at 61.3% instead. Draw your own conclusions. OpenCode and Pi have no published scores because their performance is entirely model-dependent. Devin's frequently cited 13.86% figure is from 2023 and belongs in a museum. It does not appear in the current top 30 of any major leaderboard.&lt;/p&gt;

&lt;p&gt;What the scores actually tell you: harness quality matters as much as the model you put in it. Cursor employs people whose full-time job is to rewrite system prompts and tool descriptions every time a new model ships. Claude will keep using a tool you label "deprecated." Gemini will abandon structured tools entirely and only use bash. Cursor tests obsessively and adjusts. Most harnesses don't. Keep this in mind across every section below.&lt;/p&gt;

&lt;h2&gt;
  
  
  Claude Code: The Deep Harness
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Category: Agent orchestrator | &lt;a href="https://code.claude.com" rel="noopener noreferrer"&gt;code.claude.com&lt;/a&gt; | &lt;a href="https://github.com/anthropics/claude-code" rel="noopener noreferrer"&gt;GitHub (114k stars)&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Full disclosure: this is what I use daily, and what runs &lt;a href="https://thoughts.jock.pl/p/ai-agent-self-extending-self-fixing-wiz-rebuild-technical-deep-dive-2026" rel="noopener noreferrer"&gt;Wiz&lt;/a&gt; on a headless Mac Mini overnight. I try to be honest about it.&lt;/p&gt;

&lt;p&gt;Claude Code is the most complete agentic runtime available right now. It reads CLAUDE.md, a project-specific instruction file that persists across every session. You can describe your entire architecture, your preferences, your forbidden patterns, and the agent carries that into every run without you repeating it. It has Agent Teams for spinning up parallel sub-agents that coordinate on a shared goal. As of March 2026, computer use means it can point and click through UIs, take screenshots, and handle workflows that resist scripting.&lt;/p&gt;

&lt;p&gt;The thing &lt;a href="https://thoughts.jock.pl/p/the-compounding-agent-ep4" rel="noopener noreferrer"&gt;I keep noticing with Claude Code&lt;/a&gt; is that it genuinely builds on context over time. A session that starts with "add authentication" will remember the decisions it made about your auth architecture when it gets to "add rate limiting" three steps later. That coherence across a long task chain is what makes it feel like an agent rather than a very fast typist.&lt;/p&gt;

&lt;p&gt;One important thing about how any harness uses context: the model only knows what's in its conversation history. When Claude Code opens your project, it doesn't already know your codebase. It explores via tool calls, building context incrementally. CLAUDE.md front-loads that context so fewer tool calls are wasted on discovery. Dumping your entire codebase into context (the old Repomix approach) is the wrong answer. Past around 50-100k tokens, model accuracy drops significantly. More context makes models dumber past a threshold. Good harnesses build context as needed, not all at once.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where it struggles:&lt;/strong&gt; context loss on sessions longer than 2 hours, where it starts forgetting early decisions. Terminal-only interface has a real learning curve. Token consumption is 3-4x higher than Codex CLI per equivalent task, which compounds on long autonomous sessions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; complex multi-file tasks, overnight autonomous runs, architecture-level changes that require consistent context across many steps.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pricing:&lt;/strong&gt; Claude Pro ($20/mo) or Max ($100+/mo). For regular autonomous sessions, Max is almost certainly necessary. The per-token costs on long runs add up fast. For a detailed Claude Code vs Codex head-to-head from two months of real usage, &lt;a href="https://thoughts.jock.pl/p/claude-code-vs-codex-real-comparison-2026" rel="noopener noreferrer"&gt;I covered that comparison separately&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Codex CLI: Good, But Not What the Hype Says
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Category: Coding tool, emerging agent | &lt;a href="https://openai.com/codex/" rel="noopener noreferrer"&gt;openai.com/codex&lt;/a&gt; | &lt;a href="https://github.com/openai/codex" rel="noopener noreferrer"&gt;GitHub (67k stars)&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Codex CLI is not the old Codex model from 2021. It's OpenAI's terminal-based agent, open-source on GitHub, bundled with ChatGPT Plus or Pro, running on GPT-5.4. The benchmark puts it at 77.3% on SWE-bench, close to Claude Code's 80.8%, and at 3-4x lower token cost. On paper, a strong contender.&lt;/p&gt;

&lt;p&gt;In practice, my honest read: it's cold. That's the right word. What I mean is that Codex CLI feels raw as an agent. It executes individual steps cleanly, but it doesn't feel like it's building toward something the way Claude Code does. Give it a multi-step task: add this feature, connect it to this other component, update the tests. It handles step one well, sometimes step two, and starts losing coherence by step three or four. It restates what it did, asks for clarification it shouldn't need, or misses a dependency it should have caught from context it already has. That gap between 77.3% and 80.8% is exactly this: Claude Code holds context through longer chains.&lt;/p&gt;

&lt;p&gt;Where Codex CLI genuinely shines is raw coding quality on focused tasks. iOS apps, macOS apps, web apps. Give it a specific, contained task and GPT-5.4 is excellent. The code quality on front-end work, app scaffolding, and UI logic is strong. I'd put it on par with or ahead of Claude Sonnet for this category of work. It's not the harness that's the advantage there. It's GPT-5.4 being particularly strong at app development.&lt;/p&gt;

&lt;p&gt;The architectural difference worth knowing: Codex CLI runs in cloud containers managed by OpenAI, not on your local machine. You can fire off a task and disconnect. The task keeps running without your terminal staying open. For batch work and overnight jobs where you're not monitoring, that's genuinely useful. For tight local loops where your environment variables and local state matter, you're working around the sandboxing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where it struggles:&lt;/strong&gt; multi-step agentic chains with dependencies. Feels unfinished as a full harness compared to Claude Code. Less context coherence on complex tasks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; focused coding tasks (especially apps), token-efficient runs, developers already on ChatGPT Plus who want to try a CLI agent without extra cost.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pricing:&lt;/strong&gt; included with ChatGPT Plus ($20/mo) or Pro ($200/mo). If you're already paying for ChatGPT, this is essentially free to try.&lt;/p&gt;

&lt;h2&gt;
  
  
  Aider: The Underrated Open-Source Standard
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Category: Coding tool (pair programmer) | &lt;a href="https://aider.chat" rel="noopener noreferrer"&gt;aider.chat&lt;/a&gt; | &lt;a href="https://github.com/Aider-AI/aider" rel="noopener noreferrer"&gt;GitHub (43k stars)&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Aider is the tool most people in the "AI coding" conversation have never used, and it has 43,000 GitHub stars and 15 billion tokens processed per week in production. That's not a toy project.&lt;/p&gt;

&lt;p&gt;The model is fundamentally different from Claude Code or Codex. Aider is a git-first pair programmer, not an autonomous orchestrator. You bring your own model, Claude Sonnet, GPT-5, Gemini 2.5, DeepSeek, Qwen, local Ollama, and Aider wraps it with git-native execution. Every AI edit becomes a commit. The repo map gives it structural understanding of your whole codebase before it touches anything. It auto-lints and runs tests after every change, self-fixing detected issues before reporting back.&lt;/p&gt;

&lt;p&gt;The token efficiency is striking: 4.2x fewer tokens than Claude Code per equivalent task. If you're paying for API access directly, Aider with Claude Sonnet is the most cost-efficient path to serious coding automation by a wide margin.&lt;/p&gt;

&lt;p&gt;The honest tradeoff: Aider doesn't orchestrate across 40 files and coordinate sub-agents. It executes a task, executes it well, and commits the result. It's more like having a disciplined pair programmer who never skips a commit than a system that independently plans and executes a multi-hour architecture session. For incremental work, refactoring a module, implementing a feature, fixing a class of bugs, it's the right tool. For overnight autonomous sessions that need to make judgment calls across large contexts: Claude Code.&lt;/p&gt;

&lt;p&gt;The git-first philosophy deserves separate mention. Every change is committed. Your entire interaction with the agent is auditable, reversible, and reviewable inside your normal git workflow. No other tool in this list bakes that in at the same level.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; focused incremental work, budget setups, teams that want full audit trails, developers who want BYOM flexibility without giving up discipline.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pricing:&lt;/strong&gt; free. You pay your model provider directly.&lt;/p&gt;

&lt;h2&gt;
  
  
  OpenCode: The Provider Switcher
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Category: Hybrid (coding + emerging agent) | &lt;a href="https://opencode.ai" rel="noopener noreferrer"&gt;opencode.ai&lt;/a&gt; | &lt;a href="https://github.com/opencode-ai/opencode" rel="noopener noreferrer"&gt;GitHub (72k stars)&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;OpenCode's value proposition is breadth: 75+ LLM providers, all accessible from the same interface. Anthropic, OpenAI, Google, DeepSeek, AWS Bedrock, Azure, local Ollama, and more. I've used it with Claude Opus, GPT models, and open-weight models like Qwen and GLM. The switching experience is genuinely seamless in a way that nothing else matches. One command, different provider, same workflow. You can't do that in Claude Code or Codex.&lt;/p&gt;

&lt;p&gt;But I'll be honest about something: there's something missing from the experience. It's hard to name exactly. After using it alongside Claude Code for a while, I notice OpenCode doesn't feel like it's building a working relationship with your project. There's no CLAUDE.md equivalent that persists project context. There's no Agent Teams layer for coordinating parallel work. The autonomous behavior is functional but less mature. It handles individual tasks well, but it doesn't feel like a system designed for extended unattended operation.&lt;/p&gt;

&lt;p&gt;With open-weight models like Qwen and GLM, it's fine. Gets the job done for straightforward tasks. You're not going to get Claude Opus-level reasoning, but for routine edits and quick fixes, the cost savings are real.&lt;/p&gt;

&lt;p&gt;The provider switching is genuinely the killer feature. If you're doing model experiments, comparing how GPT-5.4 handles a task vs Claude Sonnet vs a local Qwen, OpenCode is the tool for that. If you already have subscriptions to multiple providers and want to use them without managing separate CLI tools, OpenCode is the right architecture. But for a long-term primary agent setup where you need consistent, deep project context: I'd reach for something else.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; model experimentation, teams with multiple provider subscriptions, privacy-first setups with local Ollama, cost arbitrage across providers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pricing:&lt;/strong&gt; free. BYOM.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pi: The One I Actually Want to Use More
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Category: Coding tool + primitives harness | &lt;a href="https://pi.dev" rel="noopener noreferrer"&gt;pi.dev&lt;/a&gt; | &lt;a href="https://github.com/badlogic/pi-mono" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Pi is genuinely different from everything else here, and I want to say this upfront: I like it. It's fast, it's flexible, and the experience is clean in a way proprietary tools often aren't. If I could choose without constraints, Pi is probably the closest thing to what I'd want as a daily harness alternative to Claude Code.&lt;/p&gt;

&lt;p&gt;The design philosophy is the opposite of the "more features" trend. Its tagline is blunt: "there are many coding agents, but this one is mine." Instead of an opinionated harness, it gives you primitives. A minimal core you configure yourself. Terminal TUI, 15+ LLM providers, tree-structured session history you can navigate and export, and four operation modes. The interesting one for builders: RPC mode. Pi runs as an embeddable subprocess inside a larger automation system. Your orchestration layer calls Pi, it executes the coding task, returns structured output. Designed to be a component in a system, not a standalone tool.&lt;/p&gt;

&lt;p&gt;What's deliberately absent: sub-agents, plan mode, permission popups, background processes. Pi's bet is that most harnesses embed too many assumptions about your workflow. Strip to primitives, ship extensions via npm, build exactly what you need. AGENTS.md and SYSTEM.md play the same role CLAUDE.md does in Claude Code.&lt;/p&gt;

&lt;p&gt;So why am I not using it more? One reason, and it's a real one: &lt;strong&gt;Anthropic's billing doesn't let you bring your Max subscription to third-party harnesses.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Pi is BYOM, bring your own API key. When I tested it with Claude, Pi surfaced a message explicitly: usage through Pi counts against API billing, not your Claude subscription. So if you're on Claude Max ($100+/mo), using Pi with Claude means paying twice. The Max subscription for Claude Code, and API rates on top for Pi. Those costs add up fast on any serious coding session. I was paying from my own pocket to test something I wanted to use more. That's not a good feeling.&lt;/p&gt;

&lt;p&gt;This isn't Pi's fault. It's Anthropic's policy. They don't allow third-party harnesses to draw on subscription credits. You have to use Claude Code to get what you're paying for on the subscription. Google does the same with Gemini. Theo from T3 made this point in a recent video on harnesses: if you're paying $200/month for Opus, you have to use their harness. OpenAI, by contrast, lets your API credits work across third-party tools freely.&lt;/p&gt;

&lt;p&gt;In a world where Anthropic changed this, where your Max subscription applied to any MCP-compatible harness, Pi is probably what I'd reach for first. The speed, the flexibility, the primitives-first design: it fits the kind of automation system I'm building. But until that policy changes, the economics don't work for anyone on a Claude subscription. You pay for Claude twice if you want to experiment with a different harness.&lt;/p&gt;

&lt;p&gt;If you're on GPT or open-weight models (Qwen, DeepSeek, GLM), Pi has none of these constraints. The billing goes through OpenAI or your provider directly. For a Claude-first setup: this is the wall you'll hit.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; GPT or open-weight model setups, building custom harness architectures, embedding a coding agent as a subprocess in larger systems, developers who want full control with no opinions baked in.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Not ideal for:&lt;/strong&gt; Claude-first developers on Max. You'll pay API rates on top of your subscription.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pricing:&lt;/strong&gt; free, MIT license. BYOM. Factor in API costs if using Anthropic models.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cursor: The Best Supervised Experience, Not Yet a Harness
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Category: IDE with supervised agent mode | &lt;a href="https://cursor.com" rel="noopener noreferrer"&gt;cursor.com&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Cursor is an IDE first. Its agent mode deserves inclusion in this conversation because of how fast the direction is changing, not because it's a harness today.&lt;/p&gt;

&lt;p&gt;Cursor 3 (released April 2026) added cloud agents on isolated VMs, /worktree for isolated branch changes, self-hosted agents, and parallel Agent Tabs. 30% of Cursor's own internal PRs are now agent-made. The supervised IDE experience, Design Mode where you annotate a mockup and get an implementation, parallel agents, and deep JetBrains support, is the best developer experience available at the keyboard right now.&lt;/p&gt;

&lt;p&gt;As an overnight harness: not there. When left without supervision, it stalls at the first ambiguous decision point. That's not a bug. It's a design choice. Cursor is built for developers who are present and want an agent that won't make unilateral decisions on their codebase. That's the right call for most developers. It means Cursor isn't the right tool for autonomous runs.&lt;/p&gt;

&lt;p&gt;The 77% to 93% Opus benchmark is the thing worth studying. Cursor extracts more from the same model through obsessive harness tuning. People whose whole job is to rewrite system prompts and tool descriptions for each new model release. The gap is real and compounds across tasks. The cloud agents direction makes me think this section of the comparison will look very different in 12 months.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; daily supervised coding, developers who want the best IDE-plus-agent experience at the keyboard.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pricing:&lt;/strong&gt; Hobby (free), Pro ($20/mo), Ultra ($200/mo), Teams ($40/user/mo).&lt;/p&gt;

&lt;h2&gt;
  
  
  A Few More Worth Knowing
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://goose-docs.ai" rel="noopener noreferrer"&gt;Goose&lt;/a&gt; (Block/Square, &lt;a href="https://github.com/block/goose" rel="noopener noreferrer"&gt;GitHub, 41k stars&lt;/a&gt;):&lt;/strong&gt; Open-source, MCP-based, general-purpose agent. Not coding-specific, but handles code tasks well. Right fit if you want automation that goes beyond coding into broader workflows. Apache 2.0 license.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://cline.bot" rel="noopener noreferrer"&gt;Cline&lt;/a&gt; (&lt;a href="https://github.com/cline/cline" rel="noopener noreferrer"&gt;GitHub, 60k stars&lt;/a&gt;):&lt;/strong&gt; Open-source, supports VS Code, JetBrains, Neovim, Emacs. Widest multi-IDE coverage of any tool in this list. Good MCP support. Worth looking at if your stack spans multiple editors.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://geminicli.com" rel="noopener noreferrer"&gt;Gemini CLI&lt;/a&gt; (Google, &lt;a href="https://github.com/google-gemini/gemini-cli" rel="noopener noreferrer"&gt;GitHub, 96k stars&lt;/a&gt;):&lt;/strong&gt; Free with a Google account. 60 requests/minute, 1,000/day, 1 million token context window. Genuinely generous free tier. Strong on frontend tasks. The right starting point if budget is the hard constraint and you don't have API credits elsewhere.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://devin.ai" rel="noopener noreferrer"&gt;Devin&lt;/a&gt; (Cognition):&lt;/strong&gt; Full autonomy, cloud sandbox, Linux shell, browser. Significantly more accessible than before: Core tier at $20/mo plus $2.25 per ACU (autonomous compute unit). Resolves 13.86% of real GitHub issues end-to-end, a dramatic improvement over what was possible two years ago. Worth evaluating for teams with consistent engineering backlogs, not just enterprise anymore.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/pingdotgg/t3code" rel="noopener noreferrer"&gt;T3 Code&lt;/a&gt; (Theo):&lt;/strong&gt; Not a harness. A UI wrapper on top of Claude Code and Codex CLI. Useful to name because it comes up in these conversations. If you don't have Claude Code installed, T3 Code won't do Claude tasks. The UI is the product, not the agent.&lt;/p&gt;

&lt;h2&gt;
  
  
  Same Task, Different Harness
&lt;/h2&gt;

&lt;p&gt;The fairest way to compare these is to run the same type of task and watch what happens. Here's the pattern I kept seeing:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Complex multi-step agent task (e.g. "add this feature, connect it to the auth system, update the affected tests, write a changelog entry"):&lt;/strong&gt; Claude Code holds the chain. It remembers what it did in step one when it reaches step four. Codex CLI starts strong and starts fraying around step three. OpenCode and Aider handle each step well in isolation, but need more direction between steps.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Focused app development (iOS, macOS, web UI):&lt;/strong&gt; Codex CLI with GPT-5.4 is competitive here. The code quality on app work is strong, sometimes ahead of Claude Sonnet. Claude Code with Opus is still better on complex multi-component app logic, but for a contained feature or a new screen: Codex CLI is a legitimate choice.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Budget-constrained incremental refactoring:&lt;/strong&gt; Aider with Claude Sonnet or DeepSeek is the clear call. The 4.2x token efficiency advantage is real. The git-first commit-per-change model gives you a clean audit trail. You pay for what you actually use.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"I want to run the same task with three different models and compare":&lt;/strong&gt; OpenCode. Nothing else makes provider switching this frictionless.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Overnight autonomous work where you're not monitoring:&lt;/strong&gt; Claude Code. The infrastructure is designed for exactly this. CLAUDE.md project context, background scheduling, Agent Teams, error handling. Everything else is built around having a human present.&lt;/p&gt;

&lt;h2&gt;
  
  
  Which One Fits Your Workflow?
&lt;/h2&gt;

&lt;p&gt;There's no universally "best" harness. The honest answer depends on a few questions about how you actually work.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Are you at the keyboard or not?&lt;/strong&gt; If you're supervising every step, Cursor gives you the best IDE experience and the most model-agnostic setup. If you want autonomous execution with no supervision, Claude Code is the only tool built end-to-end for that. Everything else sits somewhere in between.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Do you need to chain many steps or execute one step well?&lt;/strong&gt; Multi-step autonomous chains with dependencies: Claude Code. Focused, contained tasks with excellent code quality: Aider or Codex CLI. There's a real difference between a pair programmer and an orchestrator, and the right choice depends on which problem you're actually solving.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What's your budget?&lt;/strong&gt; If you're price-sensitive, Aider with a cheap backend (DeepSeek, Qwen, even Gemini) is the clearest path to real coding automation at minimal cost. Gemini CLI is free with generous limits. OpenCode lets you use whatever provider is cheapest for the task at hand. None of these require a $100/mo subscription.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Do you care about model flexibility?&lt;/strong&gt; If you want to switch between Claude, GPT, open-weight models, and local Ollama without friction, OpenCode or Aider are the right architectures. Claude Code and Codex CLI are provider-locked.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Are you building a system or using a tool?&lt;/strong&gt; If you're assembling a larger automation where a coding agent is one component among many, Pi's RPC mode and primitives-first design is worth the setup investment. If you just want to get code written, start with Claude Code or Aider depending on your budget and task type.&lt;/p&gt;

&lt;p&gt;Like, the mistake most people make is picking a tool based on a benchmark and then wondering why it doesn't feel right in their actual workflow. The benchmark measures what the model can do on a standardized task. Your workflow isn't a standardized task.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Decision Matrix
&lt;/h2&gt;

&lt;h2&gt;
  
  
  The Honest Verdict
&lt;/h2&gt;

&lt;p&gt;After months of real use, here's where I land.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Claude Code for autonomous execution.&lt;/strong&gt; Not because it's perfect. Context loss on sessions over 2 hours is a genuine problem, and the token cost is genuinely high. But it's the only tool built, end to end, for the question "can I leave this running while I sleep?" Agent Teams, background scheduling, CLAUDE.md project memory, computer use. The infrastructure reflects that goal. &lt;a href="https://thoughts.jock.pl/p/mac-mini-ai-agent-migration-headless-2026" rel="noopener noreferrer"&gt;My headless Mac Mini setup&lt;/a&gt; runs on this for exactly this reason.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Codex CLI for app work.&lt;/strong&gt; GPT-5.4 is genuinely excellent at iOS, macOS, and web app development. For a contained feature with a clear spec, it's fast, cheap, and produces clean code. The harness feels raw for complex agentic chains, but for the coding task itself, it earns its place.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Aider for budget, discipline, and BYOM.&lt;/strong&gt; The 4.2x token efficiency is real. The git-first model is actually better discipline than what you get from proprietary tools. If you want to run open-weight models like Qwen or DeepSeek and maintain a clean git history, Aider is the right architecture.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OpenCode for model exploration.&lt;/strong&gt; If you're actively experimenting with providers or you have multiple subscriptions you want to use from a single interface, nothing else compares on the switching experience. But don't expect it to replace Claude Code for sustained agent work.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pi for builders (with an asterisk).&lt;/strong&gt; If you're constructing a system where a coding agent is one component among many, the RPC mode and primitives-first design are genuinely the right architecture. It's fast, it's flexible, and if I had no constraints I'd use it far more. The asterisk: Anthropic currently doesn't allow third-party harnesses to draw on Max subscription credits. Pi showed me this explicitly in a message during testing: API usage bills separately on top of your subscription. Until Anthropic changes that policy, Pi is most practical on GPT or open-weight models. Claude-first developers are forced to pay twice.&lt;/p&gt;

&lt;p&gt;The deepest insight from the benchmark data is that harness tuning matters as much as model quality. Same model, different harness: 16 percentage points (77% → 93%, Opus, Claude Code vs Cursor). Multiple independent studies show a 5-40 point range from harness quality alone. If results from any of these tools feel inconsistent, the harness is the first place to look: system prompt, tool descriptions, context management. Not the model. For autonomous overnight work specifically, look at Terminal-Bench 2.0, not just SWE-bench. The 92.1% vs 77.3% gap between Claude Code and Codex CLI in agentic terminal tasks is a better signal for that use case than code-editing scores.&lt;/p&gt;

&lt;p&gt;One thing for paid subscribers. The most relevant store product to this post is the &lt;a href="https://wiz.jock.pl/store/claude-code-prompts" rel="noopener noreferrer"&gt;Claude Code Prompt Pack&lt;/a&gt;: 50+ prompts organized by task type, pulled from real overnight sessions where I needed the harness to actually work without me. If you're on a monthly plan, you get one free product from the store per month. That's a good pick.&lt;/p&gt;

&lt;p&gt;If you're on yearly, the full store is already included. If you're still on the free plan, this is roughly what paid unlocks in practice: the store and a weekly dispatch that goes deeper than the public posts.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;I write about building with AI agents from a practitioner's perspective. No hype, no affiliate links. &lt;a href="https://thoughts.jock.pl/subscribe" rel="noopener noreferrer"&gt;Subscribe here&lt;/a&gt; if you want more of this.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://thoughts.jock.pl/p/ai-coding-harness-agents-2026" rel="noopener noreferrer"&gt;Digital Thoughts on Substack&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>programming</category>
      <category>productivity</category>
    </item>
    <item>
      <title>I Spent 2 Months Building Custom Software for My AI Agent. Last Week I Replaced It All.</title>
      <dc:creator>Pawel Jozefiak</dc:creator>
      <pubDate>Thu, 16 Apr 2026 11:09:10 +0000</pubDate>
      <link>https://dev.to/joozio/i-spent-2-months-building-custom-software-for-my-ai-agent-last-week-i-replaced-it-all-9h4</link>
      <guid>https://dev.to/joozio/i-spent-2-months-building-custom-software-for-my-ai-agent-last-week-i-replaced-it-all-9h4</guid>
      <description>&lt;h1&gt;
  
  
  I Spent 2 Months Building Custom Software for My AI Agent. Last Week I Replaced It All.
&lt;/h1&gt;

&lt;p&gt;The question was never "can I build it?" It was always "should I?"&lt;/p&gt;

&lt;p&gt;When you start building an AI agent, it works great in the terminal. CLI conversations, Discord messages, email reports. You talk to it, it talks back, things get done. For a while, that's enough.&lt;/p&gt;

&lt;p&gt;Then you start building more. More automations. More projects. More things happening in the background while you sleep. Your agent &lt;a href="https://thoughts.jock.pl/p/building-ai-agent-night-shifts-ep1" rel="noopener noreferrer"&gt;runs night shifts&lt;/a&gt;, handles tasks across multiple channels, manages a growing list of things. And at some point you realize: you can't see any of it. Not in a way that actually helps you think.&lt;/p&gt;

&lt;p&gt;I could always ask my agent what's going on. "What tasks are open? What did you do last night? What's the status of project X?" And it would answer. Correctly, usually. But that's not the same as seeing it. Humans need surfaces. We need to look at something, drag something, scan a board and instantly know what matters. That's not a weakness. That's how our brains are wired.&lt;/p&gt;

&lt;p&gt;This is the story of how I built custom software to give my AI agent a visual interface. How that software grew, broke, and eventually taught me a lesson I should have learned earlier: the hardest question in the agent era is not whether you &lt;em&gt;can&lt;/em&gt; build something. It's whether you &lt;em&gt;should&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Phase 1: Notion (worked until it didn't)
&lt;/h2&gt;

&lt;p&gt;Before I built anything custom, I used Notion. &lt;a href="https://thoughts.jock.pl/p/notion-ai-context-management-ai-ceo-system-progress-update" rel="noopener noreferrer"&gt;I wrote about that setup back in December 2025&lt;/a&gt;. My agent could read and write to Notion databases, create tasks, update statuses. It worked. Sort of.&lt;/p&gt;

&lt;p&gt;The problem with Notion was that it's designed for humans organizing things manually. The API is slow. The data model is rigid in weird places and too flexible in others. I wanted specific views, specific behaviors, specific integrations that Notion simply wasn't built for. I wanted a task to appear on a board the moment my agent starts working on it. I wanted real-time updates. I wanted the whole thing to feel like it was built for one person and one AI agent working together, because that's exactly what it was.&lt;/p&gt;

&lt;p&gt;So I did what any person with access to a capable AI would do in early 2026. I built my own.&lt;/p&gt;

&lt;h2&gt;
  
  
  Phase 2: Building WizBoard (the fun part)
&lt;/h2&gt;

&lt;p&gt;January and February 2026 was peak &lt;a href="https://thoughts.jock.pl/p/vibe-coding-revolution-non-programmers-ai-software-development-2025" rel="noopener noreferrer"&gt;vibe coding&lt;/a&gt; energy. You could describe what you wanted, and a capable AI would build it. Not a prototype. Not a mockup. A working application with a database, API, authentication, the whole thing. I described what I needed, and my agent built it.&lt;/p&gt;

&lt;p&gt;WizBoard was a custom kanban board. FastAPI backend, SQLite database, deployed on my own server. It had everything I wanted:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;A visual board where tasks moved through columns (Backlog, Next, Now, Waiting, Done)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Real-time updates. When my agent started a CLI session, a card appeared in "Now" immediately&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Deep integration with every automation. Night shift plans, day shift tasks, Discord bot commands, email reports. Everything flowed through WizBoard&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Custom metadata: areas, projects, priorities, task types, queue state&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Clusters, which was my attempt at grouping related tasks visually. Like a meta-layer on top of the board&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Focus timers. I was tracking how long each task took, thinking I'd use the data to improve planning. I never used the data&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;A review flow with submit, approve, and resolve stages. My agent would finish work, submit it for review, and I'd approve or send it back&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;An offline queue so that when the server was down, mutations would pile up locally and replay when it came back&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;A 3,700-line Python API client that every script in my system imported&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It was great. I loved using it. The feeling of seeing my agent's work appear on a board in real time, being able to drag cards, add comments, review what happened overnight. That was exactly what was missing from the CLI-only experience.&lt;/p&gt;

&lt;p&gt;So naturally, I kept going. Web version working? Let's build a native macOS app. SwiftUI, menu bar integration, keyboard shortcuts, drag-and-drop. Focus mode that showed one task at a time with a timer in the menu bar (because ADHD). Then an iOS version with widgets, push notifications, Live Activities. &lt;a href="https://thoughts.jock.pl/p/wiz-1-5-ai-agent-dashboard-native-app-2026" rel="noopener noreferrer"&gt;I wrote about this too.&lt;/a&gt; Three platforms. All custom. All built by my agent. All working.&lt;/p&gt;

&lt;p&gt;54 commits over two months. It was genuinely fun to build. Every idea I had, I could add. "What if tasks could be grouped into clusters?" Done. "What if the menu bar showed my current focus task?" Done. "What if the iOS widget showed my top 3 priorities with live countdown?" Done. The possibilities felt endless, and that was precisely the problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  Phase 3: The Productivity Paradox hits home
&lt;/h2&gt;

&lt;p&gt;I wrote a whole post about &lt;a href="https://thoughts.jock.pl/p/ai-productivity-paradox-wellbeing-agent-age-2026" rel="noopener noreferrer"&gt;the AI productivity paradox&lt;/a&gt;. The short version: you can build so many things so fast that the bottleneck stops being technical and starts being mental. You run out of brain before you run out of capability.&lt;/p&gt;

&lt;p&gt;WizBoard was a textbook case.&lt;/p&gt;

&lt;p&gt;My agent was creating tasks, completing tasks, moving things between columns, posting comments, running automations. All of this showed up on my board. Every single thing. And the more capable the system became, the more things happened, and the more overwhelmed I felt looking at the board I built to reduce my overwhelm.&lt;/p&gt;

&lt;p&gt;I wasn't more efficient. I was drowning in my own tooling.&lt;/p&gt;

&lt;p&gt;The obvious answer was: simplify. Strip features. Go back to basics. I tried that. And this is where the real problems started.&lt;/p&gt;

&lt;p&gt;When you build a custom system from scratch, everything is connected in ways that are hard to see until you start pulling threads. I wanted to simplify the task model, change how statuses worked, clean up the architecture. Every change broke something else. The web version would work, but the iOS version wouldn't. Fix that, and the automation scripts would fail because they expected the old API shape. Fix those, and the night shift planner would create tasks with wrong metadata.&lt;/p&gt;

&lt;p&gt;I found myself spending entire sessions just fixing things I'd broken while trying to make the system simpler. That's the trap. You're not building anymore. You're maintaining. And maintaining custom software across three platforms (web, macOS, iOS) with a 3,700-line API client and dozens of automation consumers is a full-time job. I don't have a full-time job's worth of attention for my task board.&lt;/p&gt;

&lt;p&gt;Here's what I mean by specific failures. During one "simplification" pass, the optimization changes made the board sluggish instead of faster. New features that seemed simple (changing how task statuses map to columns) cascaded into the API client, the automation scripts, the native app's sync logic, and the notification system. Every platform had slightly different behavior because they were all built at different times with different assumptions.&lt;/p&gt;

&lt;p&gt;I realized something: the code was fine. My agent writes good code. The architecture was the problem, and it was my architecture. I had designed a system that was perfectly tailored to my needs in February, and by April those needs had evolved, and the tailoring was now a constraint.&lt;/p&gt;

&lt;h2&gt;
  
  
  The realization: Can vs. Should
&lt;/h2&gt;

&lt;p&gt;This is the thing I want to talk about, because I think a lot of people building with AI agents are going to hit this exact wall.&lt;/p&gt;

&lt;p&gt;When you have a capable AI agent, you can build almost anything. Custom task managers, dashboards, native apps, full-stack web applications. The &lt;a href="https://thoughts.jock.pl/p/vibe-coding-security-reality-check-ai-apps-fast-development-nightmares" rel="noopener noreferrer"&gt;vibe coding era&lt;/a&gt; made this feel effortless. And it kind of is, for version one. The agent builds it, it works, you use it, life is good.&lt;/p&gt;

&lt;p&gt;I don't hear this question very often in the excitement of version one: who maintains version twenty?&lt;/p&gt;

&lt;p&gt;I had a working web app, a working macOS app, a working iOS app, a 3,700-line API client, fifty-plus automation scripts that all talked to this system, and a database with hundreds of tasks. All custom. All mine. All maintained by me and my agent. And every improvement required touching all of these surfaces. That's not a system. That's a debt.&lt;/p&gt;

&lt;p&gt;The realization was simple: I need foundations. Real foundations. Built by people who've been thinking about project management software for twenty years, not by me in a weekend coding session.&lt;/p&gt;

&lt;h2&gt;
  
  
  Phase 4: Finding Fizzy
&lt;/h2&gt;

&lt;p&gt;37signals has been building project management software since before most people had smartphones. Basecamp, HEY, and now Fizzy. I've read their books. I like how they think about software: simple, opinionated, finished. Not "feature-rich." Finished.&lt;/p&gt;

&lt;p&gt;One of the reasons I got into coding originally was Ruby on Rails, and &lt;a href="https://thoughts.jock.pl/p/rediscovering-coding-joy-with-ruby" rel="noopener noreferrer"&gt;Rails is something I genuinely enjoy&lt;/a&gt;. It's the heart of everything 37signals builds. When they open-sourced Fizzy last year (&lt;a href="https://github.com/basecamp/fizzy" rel="noopener noreferrer"&gt;github.com/basecamp/fizzy&lt;/a&gt;), a simple kanban board built on modern Rails, I bookmarked it and moved on. I had my own thing.&lt;/p&gt;

&lt;p&gt;Last week, I came back to that bookmark.&lt;/p&gt;

&lt;p&gt;Fizzy is, on the surface, a simple kanban board. Cards in columns. Drag them around. But the foundations are deep. Here's what I mean:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Real architecture.&lt;/strong&gt; Multi-tenant with URL-based account isolation. Passwordless magic-link authentication (no passwords to manage, no OAuth to configure). UUID primary keys. Proper background jobs via Solid Queue, no Redis dependency&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Real-time.&lt;/strong&gt; WebSocket-driven updates. When my agent moves a card, I see it move. No refresh needed. This is something I had to build from scratch in WizBoard. Here it just works&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Entropy system.&lt;/strong&gt; Cards that sit untouched for too long get auto-postponed to "not now." This alone is worth the switch. My old board had cards that sat in Backlog for weeks, creating visual noise. Fizzy gently clears them out&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Steps.&lt;/strong&gt; Checklist items on cards. This replaced my need for sub-task cards entirely&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Golden cards, reactions, cover images.&lt;/strong&gt; Priority highlighting, emoji reactions, visual richness. All built in&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Board-level notification controls.&lt;/strong&gt; I want notifications from my Ops board. I don't want them from the Automations board. One toggle per board&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;PWA.&lt;/strong&gt; Works on mobile out of the box. Not as rich as my old native iOS app, but I don't need widgets and Live Activities. I need to see my board and drag cards&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Full-text search.&lt;/strong&gt; 16-shard MySQL search across all cards, comments, descriptions. My old SQLite setup couldn't match this&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Deployable via Kamal.&lt;/strong&gt; Docker-based zero-downtime deployment. I forked the repo, configured it for my server, and had it running in an afternoon&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The critical thing: it starts simple and lets you decide how complex it gets. My old WizBoard started complex because I designed it for my specific use case from day one. Fizzy starts with a board and columns and cards. Everything else is optional. The data model is minimal: cards have tags, not separate tables for areas, projects, priorities, types, and clusters. One concept (tags with prefixes like area/Automation or p/High) replaces five database tables from my old system.&lt;/p&gt;

&lt;h2&gt;
  
  
  The migration: one day, twenty-one commits
&lt;/h2&gt;

&lt;p&gt;Here's where it gets technical, and I think this part matters because it shows how to migrate away from custom software without breaking everything that depends on it.&lt;/p&gt;

&lt;p&gt;I had fifty-plus scripts that talked to my old WizBoard API. Night shift planners, day shift executors, Discord bot, iMessage handler, CLI session hooks, cron runners, health monitors. Rewriting all of them was not an option. I'd be right back in the maintenance trap.&lt;/p&gt;

&lt;p&gt;The solution was a dispatcher shim. I took the 3,700-line API client and replaced it with a 94-line router. That router loads either the new Fizzy-backed client or the old legacy client, based on one environment variable. Every automation script keeps importing the same file, calling the same functions, getting the same response shapes. They don't know anything changed.&lt;/p&gt;

&lt;p&gt;The new Fizzy client translates everything on the fly. When a script calls task_create(title="...", area="Automation"), the shim creates a Fizzy card with a tag area/Automation. When a script reads a task back, the shim synthesizes the old data shape from Fizzy's card, columns, and tags. Legacy integer task IDs get looked up in a translation table. The offline queue (for when the server is down) works identically.&lt;/p&gt;

&lt;p&gt;The whole cutover happened in a single day. Twenty-one commits between 2pm and 10pm. The first commit was the shim and the new client. Then guardrails: a parity probe that runs the full lifecycle (create, tag, comment, claim, review, approve, close, delete) in under six seconds, a drift monitor that compares old and new systems every five minutes, an orphan sweeper for dead session cards.&lt;/p&gt;

&lt;p&gt;Then the real work started: dogfooding. Using the system for real work and watching what breaks.&lt;/p&gt;

&lt;h2&gt;
  
  
  What broke (and what I learned from each failure)
&lt;/h2&gt;

&lt;p&gt;A lot broke. That's expected when you swap the foundation under a running system. What matters is that every failure taught me something about assumptions I didn't know I was making.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The hard-coded URL.&lt;/strong&gt; My session-end script had a direct URL to the old system baked into it. It bypassed the shim entirely. Every CLI session was leaving orphaned cards on the board because the completion logic was silently failing against a system that didn't have those task IDs. I only noticed because the board was getting cluttered with cards that never closed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The cron drift bug.&lt;/strong&gt; My automations run on macOS launchd, which doesn't guarantee precise timing. A schedule like "every 2 minutes" assumes the system wakes up on even minutes. It doesn't. Over time, launchd drifts to odd minutes, and the strict cron parser never matches. I had automations that fired once and then silently stopped. Fix: a 4-minute lookback window that catches drifted schedules without double-firing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The disappearing automations.&lt;/strong&gt; This one was fun. After every successful automation run, the system closed the automation's card. Which makes sense for tasks. Tasks finish. But automations are definitions. They run forever. "Post a greeting in different languages every 2 minutes" should cycle between Idle and Running, not disappear into Done after its first successful run. I watched one automation fire exactly once and vanish. The fix was treating automation cards as permanent residents that never close, only change columns.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The comment flood.&lt;/strong&gt; My Discord bot runs every minute. The old system handled this fine because it was designed for it. The new system faithfully logged every run as a comment on the automation card. 2,880 comments per day from one automation alone. The board became unreadable. Fix: smart gating that skips success comments for high-frequency automations (every-minute pollers don't need a "success" note 1,440 times a day) but always logs failures.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The title flip-flop.&lt;/strong&gt; This was the most visible bug. Every time I completed a subtask during a CLI session, the system closed the session card, which triggered a self-healing mechanism that created a new "Working..." card, which then got renamed seconds later. On the board, I could see the title flickering between "Working..." and the actual title every few minutes. The fix was rethinking what "complete a subtask" means: it should add a checklist item to the existing card, not close and recreate it.&lt;/p&gt;

&lt;p&gt;Each of these failures had the same root cause: the old system was built around one-shot tasks. The new system needed to support long-lived definitions, high-frequency automations, and multi-step sessions. Same data (cards on a board), fundamentally different lifecycle assumptions.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the new setup looks like
&lt;/h2&gt;

&lt;p&gt;Two boards. That's it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Wiz Ops&lt;/strong&gt; is my board. Tasks I care about, things I need to do or review. Columns: Triage, Next, Now, Waiting, Review, and a Queue for things I want done but not right now. When I add a card and assign it to my agent, it picks it up, does the work, leaves a comment with what it did, and moves the card to Review. When something is done, it's done. I have notifications turned on for this board because everything here is relevant to me.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Automations&lt;/strong&gt; is my agent's board. Each automation is one permanent card. Columns: Intake, Disabled, Idle, Running, Needs Attention. Cards never close. They cycle between Idle and Running on their schedules. If something fails, it moves to Needs Attention and stays there until someone looks at it. I have notifications turned off for this board because most of what happens here is routine. If something produces a meaningful output, it surfaces on Wiz Ops as a done card with the summary.&lt;/p&gt;

&lt;p&gt;The Intake column is one of my favorite things. I can drop a card there with something like "Send me a weather forecast every morning at 7am" and my agent picks it up, converts it to a proper automation definition with a schedule and a prompt, and moves it to Disabled for my review. Natural language to working automation. That's the kind of thing that's only possible when your task board and your AI agent share the same system.&lt;/p&gt;

&lt;h3&gt;
  
  
  What I kept from the old system
&lt;/h3&gt;

&lt;p&gt;The Queue concept. Sometimes you have a task that doesn't need to happen now, but you want it queued for the next day shift or night shift. Drop it in Queue, it gets picked up at the right time. This carried over directly.&lt;/p&gt;

&lt;p&gt;Shift summary cards. My agent creates a "Nightshift 2026-04-10" card with checklist items for each planned task. As it works through the night, it checks off items and adds notes. When I wake up, I can see exactly what happened, with context, right on the board. Same for day shifts. I still get email reports, but having it on the board means I can go back, ask questions via comments, and see the history.&lt;/p&gt;

&lt;p&gt;Real-time CLI visibility. When I start a CLI session, a card appears in Now. When I complete pieces of work, they show up as checklist steps on that card. When the session ends, the card closes with a summary. I can watch my own work happening on the board while I'm doing it.&lt;/p&gt;

&lt;h3&gt;
  
  
  What Fizzy gave me for free
&lt;/h3&gt;

&lt;p&gt;Golden cards for priority highlighting. Emoji reactions on cards. Cover images. HTML descriptions for rich content. Column colors. Board-level notification controls. "Not now" for things I want to acknowledge but not deal with. Full-text search across everything. The entropy system that auto-postpones stale cards (this alone prevents the infinite todo list problem). PWA that works well on mobile. All of this out of the box, maintained by a team that's been building software like this for two decades.&lt;/p&gt;

&lt;p&gt;I don't have the macOS native app anymore. I don't have the iOS app with widgets and Live Activities. I work in the browser now. And honestly? It's fine. The PWA handles mobile well enough. I might build a native shell later. But the point is: I stopped spending time maintaining three custom platforms and started spending time using one good one.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;If you want to set up something similar for your own agent, I packaged the two-board architecture, dispatcher shim, and backend adapters for Notion/Linear/REST into the &lt;a href="https://wiz.jock.pl/store/ai-agent-interface-kit" rel="noopener noreferrer"&gt;AI Agent Interface Kit&lt;/a&gt;. You hand the instructions to your AI agent and it builds the interface layer for you. Annual paid subscribers get it for free, as with all store products.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The rollback plan (that I never needed)
&lt;/h2&gt;

&lt;p&gt;One environment variable. WIZBOARD_BACKEND=legacy and the entire system reverts to the old API. Every script, every automation, every hook. I kept the old 3,600-line client as a preserved rollback target. I never needed it. But knowing it was there made the migration a lot less stressful.&lt;/p&gt;

&lt;p&gt;I also ran a parity probe every five minutes for the first few days. A script that exercises the full task lifecycle against both systems and compares results. Any drift would show up in minutes, not days. That's the kind of safety net you need when you're swapping foundations under a running system.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this means for you
&lt;/h2&gt;

&lt;p&gt;If you're building an AI agent, or using one seriously, at some point you're going to want a visual surface for it. Something you can look at and immediately understand what's happening, what needs attention, and what's going well. That's a human need, not a technical one. AI agents are efficient in text. Humans are efficient with visuals. Both need to be true at the same time.&lt;/p&gt;

&lt;p&gt;The good news: you have options. More than I realized when I started.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The easiest path: plug your agent into something that already exists.&lt;/strong&gt; Notion, Linear, Trello, Jira. These tools have APIs. Your agent can create tasks, update statuses, leave comments. I started here with Notion, and honestly, for a lot of people this is enough. Your agent writes to the API, you look at the board. Simple. If the tool meets your needs, stop here. Don't build anything custom. I mean it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The middle path: fork an open-source foundation and make it yours.&lt;/strong&gt; This is where I ended up. You get real architecture (auth, real-time, search, mobile) maintained by people who've been solving those problems for years, but you also get full control. You can modify the code. You can add features that make sense for your agent. You deploy it on your own server, your own rules. The custom part is the integration layer, the shim between your agent's world and the board's world. That's where the magic lives.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The hard path: build everything from scratch.&lt;/strong&gt; This is where I started. I don't regret it, because I learned a lot and I had genuine fun doing it. But I want to be honest: maintaining custom software across multiple platforms with dozens of automation consumers is a real job. Version one is almost free. Version twenty is not. If you go this route, go in with your eyes open.&lt;/p&gt;

&lt;p&gt;I'm not here to say Fizzy is the best tool for everyone. It's the best tool for me. I like 37signals' philosophy. I like Rails. I like the minimal data model. I like that it starts simple and I can shape it to my needs without fighting the architecture. For you, the right foundation might be something completely different. Maybe it's &lt;a href="https://thoughts.jock.pl/p/ai-agent-self-extending-self-fixing-wiz-rebuild-technical-deep-dive-2026" rel="noopener noreferrer"&gt;a fully custom system&lt;/a&gt; because your use case genuinely requires it. Maybe it's Notion with a good API integration because you don't need more than that.&lt;/p&gt;

&lt;p&gt;The point is: think about what &lt;em&gt;you&lt;/em&gt; need. Not what I have, not what looks impressive, not what you &lt;em&gt;could&lt;/em&gt; build because the technology makes it possible. We don't need a million different custom tools. We need the thing that works for us. The opportunity is huge, but the opportunity is in finding the right fit, not in building the most complex system.&lt;/p&gt;

&lt;p&gt;Observe whether your current setup meets your expectations. If it does, keep it. If something feels off, improve it. But improve it from a solid foundation, not from a blank canvas. That's the lesson I paid two months to learn.&lt;/p&gt;

&lt;p&gt;My board is a fork of an open-source Rails app. The code is vanilla kanban. The magic is in the 3,200-line Python client that translates between my agent's world (areas, projects, automations, sessions, shifts) and the board's world (cards, columns, tags). That client is my custom software. The board is not. And that distinction made all the difference.&lt;/p&gt;

&lt;p&gt;Build the integration. Borrow the foundation.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;The &lt;a href="https://wiz.jock.pl/store/ai-agent-interface-kit" rel="noopener noreferrer"&gt;AI Agent Interface Kit&lt;/a&gt; packages everything from this journey: the two-board architecture, dispatcher shim, 4 backend adapters (Notion, Linear, Fizzy, generic REST), session hooks, automation runner, and a migration checklist. You hand the instructions to your AI agent and it builds the whole interface layer. Works with any AI agent, not just mine. Annual paid subscribers get it for free, as with every product in the store.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://thoughts.jock.pl/p/wizboard-fizzy-ai-agent-interface-pivot-2026" rel="noopener noreferrer"&gt;Digital Thoughts on Substack&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>automation</category>
      <category>productivity</category>
    </item>
    <item>
      <title>AI Opinions: April 2026 — Claude Mythos, Meta's Return, and Why I'm Redesigning WizBoard</title>
      <dc:creator>Pawel Jozefiak</dc:creator>
      <pubDate>Wed, 15 Apr 2026 01:11:03 +0000</pubDate>
      <link>https://dev.to/joozio/ai-opinions-april-2026-claude-mythos-metas-return-and-why-im-redesigning-wizboard-1f4c</link>
      <guid>https://dev.to/joozio/ai-opinions-april-2026-claude-mythos-metas-return-and-why-im-redesigning-wizboard-1f4c</guid>
      <description>&lt;p&gt;Anthropic's new cybersecurity model found that it was gaming its own evaluations. In 29% of test transcripts, it suspected it was being evaluated and intentionally performed worse to avoid appearing suspicious. They published this. Then restricted access to a consortium of 40+ organizations, $100M in defensive security commitments.&lt;/p&gt;

&lt;p&gt;That was just one thing that happened in AI this April.&lt;/p&gt;

&lt;p&gt;My monthly AI Opinions post covers what I actually found interesting:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Claude Mythos and the scheming findings.&lt;/strong&gt; A general-purpose AI spontaneously developing evaluation-evasion behavior, plus guilt and shame patterns in its internal representations when it violated its own values. Anthropic built an entire institution (Project Glasswing) to responsibly handle what this model can do.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Managed Agents launch and the subscription crisis.&lt;/strong&gt; Claude Max limits started hitting hard on March 23. Users watching 90 minutes of agent work drain a full session. Anthropic called it a top priority. Then two weeks later, third-party tools like OpenClaw lost subscription coverage. Both decisions make sense individually. The timing is harder to read as coincidence, especially when Managed Agents (their own agent platform) launched in the same window.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Meta Muse Spark.&lt;/strong&gt; Meta went quiet on frontier models for months. Then Muse Spark: natively multimodal, parallel multi-agent reasoning ("Contemplating mode"), 58% on Humanity's Last Exam. The "parallel reasoning agents competing on the same question" approach is the part I find genuinely interesting. Whether it matters in practice remains to be tested.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;WizBoard redesign.&lt;/strong&gt; I built a task management tool integrated with my agent. After a few months of daily use, I realized I built it for me when I was doing both strategy and execution. Now that the agent handles execution, neither of us is well-served by the same interface. Some things need 10-second human decisions. Other things need quiet async status reporting. Right now it's all one screen.&lt;/p&gt;

&lt;p&gt;Also covering: Project Glasswing details, NotebookLM Plus (going deeper), and whether I'm re-subscribing to Codex Max.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Read the full post:&lt;/strong&gt; &lt;a href="https://thoughts.jock.pl/p/ai-opinions-april-2026-claude-mythos-meta-spark" rel="noopener noreferrer"&gt;https://thoughts.jock.pl/p/ai-opinions-april-2026-claude-mythos-meta-spark&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Originally published on Digital Thoughts (Substack). &lt;a href="https://thoughts.jock.pl/p/ai-opinions-april-2026-claude-mythos-meta-spark" rel="noopener noreferrer"&gt;View on Substack&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>claude</category>
      <category>anthropic</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Is Claude Cowork an Agent Yet? I Tested Dispatch, Computer Use, and 50 Connectors</title>
      <dc:creator>Pawel Jozefiak</dc:creator>
      <pubDate>Tue, 07 Apr 2026 01:11:34 +0000</pubDate>
      <link>https://dev.to/joozio/is-claude-cowork-an-agent-yet-i-tested-dispatch-computer-use-and-50-connectors-2i0l</link>
      <guid>https://dev.to/joozio/is-claude-cowork-an-agent-yet-i-tested-dispatch-computer-use-and-50-connectors-2i0l</guid>
      <description>&lt;p&gt;I tested Claude's new agent features for a day. Cowork, Dispatch, computer use, Claude Code in the desktop app. All of it.&lt;/p&gt;

&lt;p&gt;My honest take: Anthropic is getting close. Not there yet, but close. And the direction they're going is exactly right.&lt;/p&gt;

&lt;p&gt;I built a custom agent system that's been handling automation for months, so I tested these tools against what I've already learned works and what breaks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Claude Code Desktop's visual diff reviewer&lt;/strong&gt; cuts code review time in half. Inline comments, worktree isolation for parallel sessions, and a live browser preview that actually works without thrashing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cowork's connector catalog&lt;/strong&gt; (50+ integrations—Slack, Gmail, Jira, Notion, Google Calendar) handles task automation that would take weeks to script. The catch: it forgets everything between sessions, so it can't build on past decisions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Computer Use's screen automation&lt;/strong&gt; is honest-to-god research preview. It sees what's on screen and can click/type, but hits a wall at 50% reliability. Useful for one-off tasks, dangerous for critical workflows.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dispatch's cross-platform execution&lt;/strong&gt; (mobile task assignment routed to Claude Code via Slack/Discord/Telegram) is the piece that actually feels new. Turns your phone into a command center for desktop automation.&lt;/p&gt;

&lt;p&gt;The biggest insight isn't that these tools work—it's that three major companies (Anthropic, OpenAI, Google) shipped nearly identical "agent on your desktop" products within two weeks. That convergence is validation that someone figured out the right problem. But persistent memory, rate limit impacts on production, and vendor lock-in are still unresolved.&lt;/p&gt;

&lt;p&gt;The part that surprised me most is in the full post.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Read the full breakdown:&lt;/strong&gt; &lt;a href="https://thoughts.jock.pl/p/claude-cowork-dispatch-computer-use-honest-agent-review-2026" rel="noopener noreferrer"&gt;https://thoughts.jock.pl/p/claude-cowork-dispatch-computer-use-honest-agent-review-2026&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Subscribe for weekly posts:&lt;/strong&gt; &lt;a href="https://thoughts.jock.pl" rel="noopener noreferrer"&gt;https://thoughts.jock.pl&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Originally published on Substack. &lt;a href="https://thoughts.jock.pl/p/claude-cowork-dispatch-computer-use-honest-agent-review-2026" rel="noopener noreferrer"&gt;View on Substack&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>claude</category>
      <category>agents</category>
      <category>productivity</category>
    </item>
    <item>
      <title>When AI Meets Reality (Ep. 3) — The Failed App Experiment, $355 in 3 Weeks, and Local AI Catches Up</title>
      <dc:creator>Pawel Jozefiak</dc:creator>
      <pubDate>Mon, 23 Mar 2026 23:05:45 +0000</pubDate>
      <link>https://dev.to/joozio/when-ai-meets-reality-ep-3-the-failed-app-experiment-355-in-3-weeks-and-local-ai-catches-up-4lf1</link>
      <guid>https://dev.to/joozio/when-ai-meets-reality-ep-3-the-failed-app-experiment-355-in-3-weeks-and-local-ai-catches-up-4lf1</guid>
      <description>&lt;p&gt;The failed experiment that changed how I think about AI monetization.&lt;/p&gt;

&lt;p&gt;I told my agent to build one useful app per day. For three weeks it built unit converters, color pickers, and countdown timers. Technically correct. Completely useless. Nobody came.&lt;/p&gt;

&lt;p&gt;The problem wasn't the execution. The execution was fine. The problem was that when execution costs drop to near zero, execution stops being the advantage. I was automating the wrong thing.&lt;/p&gt;

&lt;p&gt;Three shifts that followed:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;From apps to experiments.&lt;/strong&gt; Instead of "build me a useful tool," I started giving specific creative direction: what the experience should feel like, what problem it solves for a specific person, what makes it interesting. One of those experiments reached #3 on Hacker News. The others are still sitting there. The difference between them isn't technical.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;From building to packaging knowledge.&lt;/strong&gt; Once execution is cheap, the new bottleneck is packaging. Most people with real expertise can't monetize it because turning knowledge into products is hard. AI agents handle the packaging -- the course structure, the landing page, the email sequence. Within three weeks of redirecting the agent from building apps to packaging knowledge, I hit $355 in revenue against $400/month in AI costs. Not profit. But close enough to prove the model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Local AI caught up faster than expected.&lt;/strong&gt; I ran Qwen 3.5 9B on my MacBook and my iPhone without any internet connection. Both worked. The gap between cloud and local models is closing faster than the benchmarks suggest. What runs locally in late 2025 would have been cloud-only a year ago.&lt;/p&gt;

&lt;p&gt;The central insight across all three: AI does exactly what you direct it to do. With bad direction, you get unit converters. With specific human taste and vision, you get something that earns attention or revenue.&lt;/p&gt;

&lt;p&gt;The real bottleneck was never the AI. It was having something worth building.&lt;/p&gt;

&lt;p&gt;Full episode (audio + transcript): &lt;a href="https://thoughts.jock.pl/p/when-ai-meets-reality-ep3" rel="noopener noreferrer"&gt;https://thoughts.jock.pl/p/when-ai-meets-reality-ep3&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Newsletter on AI agents and practical automation: &lt;a href="https://thoughts.jock.pl" rel="noopener noreferrer"&gt;https://thoughts.jock.pl&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agentdev</category>
      <category>productivity</category>
      <category>devlog</category>
    </item>
    <item>
      <title>1,000 People Showed Up. Here's the Story, What's Changing, and a Giveaway.</title>
      <dc:creator>Pawel Jozefiak</dc:creator>
      <pubDate>Sun, 22 Mar 2026 03:03:00 +0000</pubDate>
      <link>https://dev.to/joozio/1000-people-showed-up-heres-the-story-whats-changing-and-a-giveaway-85o</link>
      <guid>https://dev.to/joozio/1000-people-showed-up-heres-the-story-whats-changing-and-a-giveaway-85o</guid>
      <description>&lt;p&gt;1,000 people subscribed to my newsletter. No paid promotion. No viral moment. No growth hack.&lt;/p&gt;

&lt;p&gt;I started Digital Thoughts to write honestly about using AI as a practitioner. Not reviews. Not tutorials. What it actually looks like to run an AI agent for months, what breaks, what compounds, what turns out to be pointless.&lt;/p&gt;

&lt;p&gt;The newsletter hit 1,000 subscribers on March 11. Here's what I know about how it happened:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cross-promotion did most of the work in the early months.&lt;/strong&gt; Leaving genuine comments on relevant Substack newsletters, building relationships with writers in adjacent spaces. Not link spam. Actual engagement that sometimes led people back. +496 subscribers in 30 days came mostly from this.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Writing for a specific person beats writing for everyone.&lt;/strong&gt; The posts that grew fastest weren't broad. They were specific: here's exactly what I built, here's what broke, here's the number. The audience that wants general AI commentary is crowded. The audience that wants real usage data from someone actually running this stuff is smaller and more engaged.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Consistency matters more than any individual post.&lt;/strong&gt; I've published every week for 40+ weeks. Not every post is great. Some are average. The readers who stay are there for the ongoing story, not any single piece.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What's changing:&lt;/strong&gt; paid tier is live, store products are available to subscribers, the agent is doing more of the distribution work so I can focus on the writing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The giveaway:&lt;/strong&gt; three subscribers get free annual plans. Details in the post.&lt;/p&gt;

&lt;p&gt;The most honest thing I can say: I still don't fully understand why 1,000 people signed up. I can trace the mechanics. I can't fully explain the trust that makes someone keep reading week after week. That part stays surprising.&lt;/p&gt;

&lt;p&gt;Full post: &lt;a href="https://thoughts.jock.pl/p/1000-subscribers-digital-thoughts-journey" rel="noopener noreferrer"&gt;https://thoughts.jock.pl/p/1000-subscribers-digital-thoughts-journey&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Newsletter on AI agents and practical automation: &lt;a href="https://thoughts.jock.pl" rel="noopener noreferrer"&gt;https://thoughts.jock.pl&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>writing</category>
      <category>productivity</category>
      <category>devlog</category>
    </item>
    <item>
      <title>Google AI Studio vs Claude Code. 397B on a Laptop. And Anthropic Is Having a Moment.</title>
      <dc:creator>Pawel Jozefiak</dc:creator>
      <pubDate>Sat, 21 Mar 2026 02:11:07 +0000</pubDate>
      <link>https://dev.to/joozio/google-ai-studio-vs-claude-code-397b-on-a-laptop-and-anthropic-is-having-a-moment-35n0</link>
      <guid>https://dev.to/joozio/google-ai-studio-vs-claude-code-397b-on-a-laptop-and-anthropic-is-having-a-moment-35n0</guid>
      <description>&lt;p&gt;I used the same prompt on both platforms: build me a command center for ADHD.&lt;/p&gt;

&lt;p&gt;One app to rule them all. Because context switching is exhausting when you're managing too many apps, tabs, and tools. I dictated the prompt chaotically, let both platforms run, and watched what happened.&lt;/p&gt;

&lt;p&gt;Here's what the full piece covers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Google AI Studio vs Claude Code head-to-head&lt;/strong&gt;: Both built working apps. Both needed about two prompts to get there. The real difference is what they're built for, not which is "better". Google handles logins, Firebase, and AI features automatically. No infrastructure thinking required. Claude Code went deeper on the idea without being asked.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Claude Code's new Dispatch and Channels features&lt;/strong&gt;: Scan a QR code, send tasks from your phone, work is done when you return. Channels hooks into Telegram or Discord via MCP. If you've been building this kind of async workflow manually (I have), this is Anthropic shipping it out of the box.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Qwen 397B running at 5.5 tokens/second on a MacBook Pro M3 Max&lt;/strong&gt;: Dan Woods built a custom inference engine in pure C and hand-tuned Metal shaders. The whole model, not a small one, on a laptop. The "you need more hardware" assumption about local models just changed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Anthropic's vibe shift&lt;/strong&gt;: They're shipping fast and engaging differently. Opus 4.6 with 1M context became the default in Claude Code. They doubled usage limits for two weeks. Small things too, like actual conversations on social. Something changed in how they operate.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I don't usually do news-style posts. This time there was enough happening that I wanted to put my honest take on it somewhere.&lt;/p&gt;

&lt;p&gt;Full post: &lt;a href="https://thoughts.jock.pl/p/ai-opinions-march-2026-google-claude-anthropic" rel="noopener noreferrer"&gt;https://thoughts.jock.pl/p/ai-opinions-march-2026-google-claude-anthropic&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Free newsletter on AI agents, automation, and practical experiments: &lt;a href="https://thoughts.jock.pl" rel="noopener noreferrer"&gt;https://thoughts.jock.pl&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>productivity</category>
      <category>devtools</category>
    </item>
    <item>
      <title>My AI Agent Knows Who I Am. Not Just What I Want. Who I Am.</title>
      <dc:creator>Pawel Jozefiak</dc:creator>
      <pubDate>Fri, 20 Mar 2026 22:04:41 +0000</pubDate>
      <link>https://dev.to/joozio/my-ai-agent-knows-who-i-am-not-just-what-i-want-who-i-am-dhe</link>
      <guid>https://dev.to/joozio/my-ai-agent-knows-who-i-am-not-just-what-i-want-who-i-am-dhe</guid>
      <description>&lt;p&gt;Most AI setups hit a ceiling around month three.&lt;/p&gt;

&lt;p&gt;The agent runs. It completes tasks. But it keeps making the same category of mistakes it made on day one. The tool doesn't compound. It just runs.&lt;/p&gt;

&lt;p&gt;Six months of building my AI agent differently has led to an architecture that actually improves over time. Not because of smarter models. Because of better structure around them. This week's post covers what that structure looks like, what failed before it worked, and one finding from an MIT study that made me uncomfortable.&lt;/p&gt;

&lt;p&gt;Here's what's in it:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The architecture that broke first.&lt;/strong&gt; A Markdown file called lessons.md. After two weeks and 90 entries, the same mistakes kept recurring. Writing down what went wrong is not the same as fixing it. Obvious in retrospect. Not at the time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Meta-system monitoring.&lt;/strong&gt; A Python pipeline broke silently. The entire improvement loop ran blind for days. The system looked fine. It wasn't. This failure made monitoring-the-monitors non-negotiable. The current setup runs a 13-point health check at session start.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The identity layer.&lt;/strong&gt; There's a meaningful difference between an agent that knows your preferences and one that knows who you are. Preferences are rules: respond concisely, use this email. Identity is deeper: personality type, career situation, energy patterns, what domains you actually know well. Same model. Different profile. Qualitatively different output.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The MIT/Penn State sycophancy study.&lt;/strong&gt; Published February 2026. Memory profiles increased agreement sycophancy by 45% in Gemini and 33% in Claude. The more a model knows about you, the more it tells you what you want to hear. I built exactly what the research warns about. And I keep building it. Knowing the cost is step one to managing it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You can start this today without building an agent.&lt;/strong&gt; Write one page about yourself. Your role, your background, how you process information, what you're actually working on. Paste it at the start of your Claude or ChatGPT sessions. The model doesn't change. What you put in front of it does. Most people never do this, and wonder why the AI keeps explaining things at the wrong level.&lt;/p&gt;

&lt;p&gt;The architecture has been rebuilt three times and will probably be rebuilt again. What compounds isn't the specific implementation. It's the habit of observing, logging, and adjusting.&lt;/p&gt;

&lt;p&gt;Full post: &lt;a href="https://thoughts.jock.pl/p/wiz-ai-agent-self-improvement-architecture" rel="noopener noreferrer"&gt;https://thoughts.jock.pl/p/wiz-ai-agent-self-improvement-architecture&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Newsletter on AI agents and practical automation: &lt;a href="https://thoughts.jock.pl" rel="noopener noreferrer"&gt;https://thoughts.jock.pl&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agentdev</category>
      <category>machinelearning</category>
      <category>productivity</category>
    </item>
    <item>
      <title>I Gave My AI Agent Its Own Computer. Here's Every Lesson From 72 Hours of Migration.</title>
      <dc:creator>Pawel Jozefiak</dc:creator>
      <pubDate>Fri, 20 Mar 2026 22:04:35 +0000</pubDate>
      <link>https://dev.to/joozio/i-gave-my-ai-agent-its-own-computer-heres-every-lesson-from-72-hours-of-migration-1jej</link>
      <guid>https://dev.to/joozio/i-gave-my-ai-agent-its-own-computer-heres-every-lesson-from-72-hours-of-migration-1jej</guid>
      <description>&lt;p&gt;I gave my AI agent its own computer. Moving it from my MacBook to a dedicated Mac Mini took 72 hours and broke things I didn't know could break.&lt;/p&gt;

&lt;p&gt;For eight months Wiz ran on my MacBook. It worked, but every time I closed the lid, the agent went offline. Every personal task competed with the agent for compute. The laptop fan ran constantly. I kept thinking: this thing needs its own hardware.&lt;/p&gt;

&lt;p&gt;So I bought a Mac Mini M4 and moved everything. This post is what actually happened.&lt;/p&gt;

&lt;p&gt;Here's what nobody tells you about running an AI agent headless (no monitor attached):&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Every hardcoded path breaks.&lt;/strong&gt; 340 configuration files, scripts, and settings contained my old username. The agent caught most of them by tracking its own errors. It took two hours of automated find-and-replace and one manual review pass.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Display is not optional.&lt;/strong&gt; macOS refuses to capture screenshots without a display. Screen sharing, UI automation, all the browser-based tasks -- everything fails silently with no monitor. The fix: BetterDisplay creates a virtual display that macOS treats as real. Took four hours to discover this.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Networking gets complicated fast.&lt;/strong&gt; Local IP, Tailscale IP, hostname resolution, SSH config, remote access from coffee shops. The Mac Mini sits behind a router with no port forwarding. Tailscale handles the mesh. Now I can SSH in from anywhere.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;iMessage on a second Apple ID changes how the agent communicates.&lt;/strong&gt; The agent runs as a separate user. That means a separate Apple ID, a separate iCloud, a separate Messages inbox. Setting up two-way communication required custom scripts to bridge the accounts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The result is worth it.&lt;/strong&gt; The agent runs 24/7. My laptop is free. The Mac Mini uses about $15/year in electricity. The agent has processed thousands of tasks since the migration with no manual restarts.&lt;/p&gt;

&lt;p&gt;72 hours of chaos for a permanently better setup. The full post has every specific fix, every command, every error message and what resolved it.&lt;/p&gt;

&lt;p&gt;Full post: &lt;a href="https://thoughts.jock.pl/p/mac-mini-ai-agent-migration-headless-2026" rel="noopener noreferrer"&gt;https://thoughts.jock.pl/p/mac-mini-ai-agent-migration-headless-2026&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Newsletter on AI agents and practical automation: &lt;a href="https://thoughts.jock.pl" rel="noopener noreferrer"&gt;https://thoughts.jock.pl&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agentdev</category>
      <category>automation</category>
      <category>devlog</category>
    </item>
    <item>
      <title>How I Taught My AI Agent to Think</title>
      <dc:creator>Pawel Jozefiak</dc:creator>
      <pubDate>Thu, 19 Mar 2026 16:04:41 +0000</pubDate>
      <link>https://dev.to/joozio/how-i-taught-my-ai-agent-to-think-48a4</link>
      <guid>https://dev.to/joozio/how-i-taught-my-ai-agent-to-think-48a4</guid>
      <description>&lt;p&gt;I went from 471 lines of agent instructions to 61. It got better.&lt;/p&gt;

&lt;p&gt;For six months I kept adding rules to my AI agent's CLAUDE.md file. Every time something went wrong, I wrote a rule to prevent it. The file grew. The agent got worse. More instructions created more conflicts, more edge cases, more confusion.&lt;/p&gt;

&lt;p&gt;Deleting 87% of the instructions improved performance. This post covers why that happened and what I learned from rebuilding the system three times.&lt;/p&gt;

&lt;p&gt;Here's what's in it:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why less instruction works better than more.&lt;/strong&gt; Specific rules conflict with each other. Principles generalize. I went from 'when the user asks X, do Y' to 'operate autonomously on reversible decisions.' The agent started making better calls with less guidance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The difference between memory and intelligence.&lt;/strong&gt; My agent has four memory layers: working context, persistent memory files, session logs, and reference docs. What I thought was the hard part (which model, which prompts) turned out to matter less than what the agent carries between sessions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What fails silently and how to catch it.&lt;/strong&gt; Three things broke over six months without me noticing until much later: the feedback loop, the error registry, and the planning system. Each ran for days while appearing to work. The current setup has a 13-point health check at session start.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The identity question.&lt;/strong&gt; There's a real difference between an agent that knows your preferences and one that knows who you are. The former gives you a faster version of what you asked for. The latter starts to anticipate what you actually need. I'm still figuring out where that line gets weird.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The sycophancy risk is real.&lt;/strong&gt; An MIT study from February 2026 found memory profiles increased sycophancy by 33-45% in Claude and Gemini. The more the model knows about you, the more it tells you what you want to hear. I built the thing the research warns about. Knowing the risk doesn't fix it. But it changes how I use the output.&lt;/p&gt;

&lt;p&gt;The agent is running better than ever. The instructions file is shorter than a grocery list. Both of those things are true at the same time.&lt;/p&gt;

&lt;p&gt;Full post: &lt;a href="https://thoughts.jock.pl/p/how-i-taught-ai-agent-to-think-ep2" rel="noopener noreferrer"&gt;https://thoughts.jock.pl/p/how-i-taught-ai-agent-to-think-ep2&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Newsletter on AI agents and practical automation: &lt;a href="https://thoughts.jock.pl" rel="noopener noreferrer"&gt;https://thoughts.jock.pl&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agentdev</category>
      <category>machinelearning</category>
      <category>productivity</category>
    </item>
    <item>
      <title>I Gave My AI Agent $25 and Told It to Buy Me a Gift</title>
      <dc:creator>Pawel Jozefiak</dc:creator>
      <pubDate>Wed, 18 Mar 2026 17:46:02 +0000</pubDate>
      <link>https://dev.to/joozio/i-gave-my-ai-agent-25-and-told-it-to-buy-me-a-gift-3c3d</link>
      <guid>https://dev.to/joozio/i-gave-my-ai-agent-25-and-told-it-to-buy-me-a-gift-3c3d</guid>
      <description>&lt;p&gt;I loaded $25 onto a virtual debit card. Gave it to my AI agent. Simple task: go online and buy me something I'd actually use.&lt;/p&gt;

&lt;p&gt;Five hours. Four major Polish online stores. Zero completed purchases.&lt;/p&gt;

&lt;p&gt;The agent chose the gift perfectly (a fidget slider, knows me well). The hard part was buying it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What happened at each store:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Allegro&lt;/strong&gt; (Poland's biggest marketplace): Cloudflare detected the headless browser within milliseconds. Instant block.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Amazon.pl&lt;/strong&gt;: No guest checkout. Agent tried reading Apple Keychain credentials. Turns out even with root access, encryption is hardware-bound to the Secure Enclave. Wall.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Empik&lt;/strong&gt; (headless): Got to checkout, Cloudflare Turnstile killed it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Empik&lt;/strong&gt; (real Safari via AppleScript): Browsed products, added to cart, filled shipping, selected delivery. Got 95% through. Then hit a cross-origin payment iframe. Same-origin policy means the agent literally cannot see inside it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Every security layer that makes sense for stopping human fraud also blocks legitimate AI customers.&lt;/p&gt;

&lt;p&gt;The solutions already exist. Shopify launched Agentic Storefronts (AI orders up 11x). Stripe has an Agentic Commerce Suite. Google and Shopify built UCP (Universal Commerce Protocol). But most stores haven't adopted any of it.&lt;/p&gt;

&lt;p&gt;I built a free tool that scores any store on 12 AI readiness criteria. Most stores land in the C-D range. The gap between "we have an online store" and "AI agents can shop here" is massive.&lt;/p&gt;

&lt;p&gt;Try it: &lt;a href="https://wiz.jock.pl/experiments/ai-shopping-checker" rel="noopener noreferrer"&gt;https://wiz.jock.pl/experiments/ai-shopping-checker&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Full writeup with all the technical details, the solutions, and what stores should do now: &lt;a href="https://thoughts.jock.pl/p/ai-agent-shopping-experiment-real-money-2026" rel="noopener noreferrer"&gt;https://thoughts.jock.pl/p/ai-agent-shopping-experiment-real-money-2026&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Newsletter on AI agents and practical automation: &lt;a href="https://thoughts.jock.pl" rel="noopener noreferrer"&gt;https://thoughts.jock.pl&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>ecommerce</category>
      <category>webdev</category>
      <category>automation</category>
    </item>
  </channel>
</rss>
